Multi-document RAG Chatbot - Part 2

Multi-document RAG Chatbot - Part 2

Welcome to the second part of building a RAG-based multi-document chatbot. By now, you should have some idea of what the chatbot is and how I’m approaching the project. Now, it's time to dive deeper.

Before we continue, I want to clarify a few things:

I’m not training any models here. Why? Because training a model costs an enormous amount of money—millions, in fact.

I’m also not fine-tuning a model, as this requires a large labeled dataset, which is beyond the scope of this project.

I’m not teaching the model anything new.

So, what am I doing?

I’m simply providing the right information to the language model. My role is to organize and supply relevant chunks of data, allowing the model to generate accurate answers.

Tech Terminologies

Langchain: Langchain is an open-source framework that helps developers build applications using Large Language Models.

Why did I choose this framework?

I chose Langchain because there's a large, active community using it. If you run into any issues, there are plenty of people who can offer support. While I'd love to explore other frameworks in the future, as a beginner, I found Langchain to be quite popular, which made it an easy choice for me.

How did I learn it?

I learned Langchain through YouTube videos and the official documentation. Langchain offers a lot of functionality, and since you can’t learn everything at once, I’m picking things up as I go along. This was the first time I thoroughly read the official documentation, and it turned out to be really helpful, so I recommend going through it regularly as you progress.

Langsmith: Langsmith is a platform that allows you to debug, evaluate, analyze, and monitor your LLM-powered applications. I use it a lot to track everything, from the questions I ask the chatbot to its responses, the time it takes to respond, token usage, and more (I’ll explain more about tokens later in the article).

Below is a snapshot of my Langsmith account, showing how it tracks various stats for my project.

Let’s learn about RAG a little deeper before we proceed further.

Retrieval Augmented Generation (RAG)

As explained in Part 1 of this article, RAG (Retrieval-Augmented Generation) is a method that helps Large Language Models (LLMs) answer questions based on specific information they weren't originally trained on. Let’s use an example to understand how RAG works.

Imagine we are chatbot developers, and a large automobile company approaches us. They want us to create a system to help their new employees easily learn the basics of all the automobile parts. They have a huge collection of documents for every vehicle they've ever built, including detailed information about each model. Since ChatGPT is trending, they want their own custom chatbot that can answer any questions related to these complex automobile documents. The first solution that comes to mind is RAG. Why? We already know the answer.

So, how do we explain this to the company owner in simple terms?

"We’ll take your custom automobile documents, feed them into a model, and it will analyze the content. After that, you can ask it any question you want about the documents. Simple!"

RAG Pipeline

this whole process of taking the document, analyzing it, storing it somewhere and then giving answer is called RAG Pipeline in technical terms.

  • Load: This step involves loading the documents into the model. These documents will serve as the knowledge base for the chatbot.

  • Split: The documents are then broken down into smaller, manageable chunks. This is done because each model can only process a limited number of tokens (which could be characters, words, or groups of words) at a time. The token limit varies across different language models, so the splitting is adjusted accordingly.

  • Embed: Since language models don’t understand plain text, we need to convert these chunks into a format they can process—called vectors. This is done using embedding models, which turn the text into vector representations.

  • Store: The converted vectors are stored in specialized databases known as vector stores. There are several options for vector stores; in this project, I’ve used ChromaDB.

The RAG pipeline outlines the steps involved in developing a RAG application. Now the question arises: how does the chatbot provide answers when a user asks a question? Let’s explore what happens next.

RAG Stack

I found this amazing diagram from the video: Chat with multiple PDFs, and it gave me a very clear picture of how everything works inside a chatbot. Below given is the same diagram with my own little additions:

Data Ingestion: This stage refers to the process I explained earlier in the RAG pipeline. It involves taking documents, splitting them into smaller chunks, converting them into vectors, and storing them in vector databases. This entire process is called data ingestion.

Retrieval and Synthesis: From the diagram, you can see that when a user asks a question, the question is also transformed into embeddings. We then perform similarity searches to find the embeddings that closely match those in the vector stores. The retrieved document embeddings are ranked based on how similar they are to the question embeddings. Finally, these ranked results are sent to the appropriate language model (LLM), which generates a final answer and presents it to the user.

Challenges with the “naive” RAG

So, is it as simple as feeding the documents to the system, and when the user asks a question, it just gives the correct answer? Not at all! There are several challenges when building RAG (Retrieval-Augmented Generation) applications. Below, I’ve listed some of the issues I personally encountered while developing this chatbot. There may be more challenges out there, but I’ll focus on the ones I’ve faced.

USER: “Hey AI, can you give me a list of all the events happening in my college this month?”

AI: “Your exams are coming up this month.” :)

It’s frustrating to get responses like this because it’s not what you were looking for. This happens due to:

  • Bad Retrieval:

    • The chunks retrieved are not relevant to the question, meaning the system failed to get the correct information.

    • Sometimes, all relevant chunks are retrieved, but they don’t provide enough context for the model to generate a proper answer.

    • Some information may be outdated.

  • Bad Response Generation:

    • Hallucination: This term comes up often when dealing with chatbots. A model "hallucinates" when it generates an answer that isn’t based on the provided documents. I’ve experienced this issue frequently. In future posts, I’ll share how I overcame these problems.

    • Irrelevance: When the model gives an answer that doesn’t actually respond to the question.

    • Toxicity/Bias: Sometimes, the model may generate harmful or offensive responses.


Resources I have used

I get how frustrating it can be to sit through a 3-hour YouTube video, only to realize there are hundreds of similar ones online. It can be overwhelming to pick the right resource from so many options. To help make things easier, I’ve listed the resources that personally helped me get started on the project. I'm not saying other resources aren't good or that you shouldn't check out other articles, videos, or documentation—these are just the ones I found useful and liked.

NLP tutorial python playlist: youtube.com/playlist?list=PLeo1K3hjS3uuvuAX..

Understanding how RAG works: youtube.com/watch?v=TRjq7t2Ms5I

Langchain Masterclass : youtube.com/watch?v=yF9kGESAi3M&t=9289s

Langchain playlist: youtube.com/watch?v=tEL833CPhqw&list=PL..

Langchain multi pdf tutorial: youtube.com/watch?v=dXxQ0LR-3Hg&t=3226s

Multi-document RAG chatbot using Streamlit: youtube.com/watch?v=3ZDVmzlM6Nc

I'm learning as I go, so if you spot anything I could improve, feel free to share your feedback.

🤖 Here is the GitHub link to my project: RAG_Chatbot