Building efficient Retrieval-Augmented Generation (RAG) chatbots is a tough problem, especially when you’re dealing with complex, dynamic data sources such as large websites, PDFs with embedded images, and mixed-format tables.
Traditional RAG pipelines often face bottlenecks in:
Long ingestion times during data scraping and processing,
Irrelevant retrievals due to noisy chunks,
High latency and token costs during query response.
In this session, we’ll explore a real-world implementation of a RAG-based chatbot that overcame these performance issues through four key innovations:
Accelerated Ingestion (Asyncio):
By building an asynchronous scraping layer using asyncio, we reduced total website crawl time from 5 hours 18 minutes to 40 minutes, a 9× speedup in ingestion.
Multi-Modal Chunking:
We developed a chunking pipeline that intelligently processes text, images, and tables, preserving contextual relationships to improve the embedding and retrieval accuracy.
Hybrid Re-Ranking:
Instead of relying solely on semantic similarity, our re-ranking model blends semantic relevance with metadata factors (page authority, depth, and domain weight) to surface the most relevant snippets.
Optimized Query Workflow:
Incoming user queries are classified (general, gibberish, site-specific) and rephrased to ensure optimal recall during the retrieval phase all while reducing token usage and response latency.
This end-to-end pipeline produced a chatbot that was significantly faster, more cost-efficient, and contextually precise ideal for real-world web-scale RAG systems.
The talk will walk attendees through:
Architecting an async data ingestion pipeline for large-scale websites,
Implementing multi-modal preprocessing and chunking,
Designing a hybrid scoring function (α·semantic + β·page weight + γ·domain authority) for re-ranking,
Integrating the pipeline into a retrieval and response generation workflow with LLMs.
By the end, participants will gain a clear blueprint for applying similar optimisations to their own RAG chatbots, achieving both speed and quality without excessive token usage.
Build async data ingestion systems using Python’s asyncio for massive speed improvements.
Apply multi-modal chunking to preserve context across images, text, and tables.
Use hybrid scoring for more intelligent retrieval.
Learn how query optimization reduces cost and latency while improving accuracy.