Building efficient Retrieval-Augmented Generation (RAG) chatbots is a tough problem, especially when you’re dealing with complex, dynamic data sources such as large websites, PDFs with embedded images, and mixed-format tables.
Traditional RAG pipelines often face bottlenecks in:
Long ingestion times during data scraping and processing,
Irrelevant retrievals due to noisy chunks,
High latency and token costs during query response.
In this session, we’ll explore a real-world implementation of a RAG-based chatbot that overcame these performance issues through four key innovations:
Accelerated Ingestion (Asyncio):
By building an asynchronous scraping layer using asyncio, we reduced total website crawl time from 5 hours 18 minutes to 40 minutes, a 9× speedup in ingestion.
Multi-Modal Chunking:
We developed a chunking pipeline that intelligently processes text, images, and tables, preserving contextual relationships to improve the embedding and retrieval accuracy.
Hybrid Re-Ranking:
Instead of relying solely on semantic similarity, our re-ranking model blends semantic relevance with metadata factors (page authority, depth, and domain weight) to surface the most relevant snippets.
Optimized Query Workflow:
Incoming user queries are classified (general, gibberish, site-specific) and rephrased to ensure optimal recall during the retrieval phase all while reducing token usage and response latency.
This end-to-end pipeline produced a chatbot that was significantly faster, more cost-efficient, and contextually precise ideal for real-world web-scale RAG systems.
The talk will walk attendees through:
Architecting an async data ingestion pipeline for large-scale websites,
Implementing multi-modal preprocessing and chunking,
Designing a hybrid scoring function (α·semantic + β·page weight + γ·domain authority) for re-ranking,
Integrating the pipeline into a retrieval and response generation workflow with LLMs.
By the end, participants will gain a clear blueprint for applying similar optimisations to their own RAG chatbots, achieving both speed and quality without excessive token usage.