Intro to Multi-document RAG Chatbots

Have you ever wondered if chatbot technology is a recent development? Or if ChatGPT was the first chatbot ever created? How can you build one yourself? If these questions have crossed your mind like they did mine, then this article will help clear up your doubts and guide you through hands-on coding to create your own chatbot.

According to Codecademy's article on the history of chatbots (https://www.codecademy.com/article/history-of-chatbots), chatbots have been around for quite some time. The article does a great job of explaining how chatbots have evolved from basic pattern recognition systems to the advanced tools like ChatGPT and Gemini that we use today.

This article will serve as a roadmap for anyone new to chatbot technology and feeling unsure of where to begin, just like I once was. You can follow along step-by-step, as everything in this article has been tried and tested by me.

Before we get started with development, there are some terms that comes around when you talk about chatbots.

Types of Chatbots

There are two major types of chatbots that I have got to know about:

Rule-Based Chatbots: These chatbots respond to questions based on specific rules set by the programmer. The conversations are pre-defined, which means the chatbot can only answer questions that fall within these rules. If a user asks something outside of those rules, the chatbot won't be able to respond.
Open-Ended Chatbots: These are a solution to the limitations of rule-based chatbots. For example, ChatGPT can answer a wide range of questions without being limited by specific rules. You can ask it anything on any topic, and it will try to respond. These chatbots aren't based on hard-coded knowledge but instead learn from a broad set of information.

There are many technologies available to create both rule-based and open-ended chatbots. In this project, I will focus on developing an open-ended chatbot using organization-specific data that I will provide.

The problem I am solving using this chatbot

In the tech talks I’ve attended, a common question people often ask is, “With so many technologies available, which one should I learn?” This question came up during the Google Girls Hackathon '24 training session as well, where a senior developer advised, “First, identify the problem you want to solve, then choose the technology that can address it.”

With this approach in mind, I decided to tackle a problem within my university environment. From my experiences at college, I’ve seen many students struggle when their questions go unanswered or when they receive conflicting answers from different people. Additionally, important information is sometimes not shared promptly. I'm not suggesting that the university can't provide this information; rather, there are too many channels for information, which can confuse students about which sources to trust.

By developing this conversational AI, I aim to improve the experience of getting answers and clarifying doubts. I want to ensure that all students have quick and easy access to accurate and up-to-date information.

My goal in creating this chatbot is to make essential information accessible quickly and efficiently, helping students navigate their university experience with ease.

What is the chatbot designed to do?

I wanted the chatbot to respond to questions about my university. If you ask ChatGPT what's happening at your university, you might get a response like this: (I’ve asked many questions about my university, so that might be why it recognizes the name :) )

I came across two terms related to solving this problem: Retrieval Augmented Generation (RAG) and Fine-tuning. Since I found these terms to be very interrelated and confusing, I want to clarify what they mean:

Retrieval Augmented Generation (RAG): Simply put, RAG is a technique that improves Large Language Models (LLMs) to answer questions based on specific knowledge that they weren't originally trained on. LLMs are designed to process large amounts of data and can generate responses for tasks like answering questions, translating languages, and completing sentences. RAG enhances the capabilities of LLMs for specific areas or an organization’s internal knowledge without requiring the model to be retrained. This makes it a cost-effective way to ensure the LLM’s responses are relevant, accurate, and useful.
Fine-tuning: According to IBM, fine-tuning involves providing the model with labeled data specific to the application, such as typical questions and the correct answers in the desired format. For instance, if a team is developing a customer service chatbot, they would compile many documents with customer service questions and their correct answers, and then use those documents to train the model.

Now that you understand these terms, you can see why I refer to it as a RAG chatbot. Essentially, I am making the LLM answer questions based on additional documents I provide. While fine-tuning can be labor-intensive and often requires outsourcing to companies that specialize in data labeling, RAG is something I can start using right now. So, I see it as a path from RAG to fine-tuning, if needed.

Prerequisites

You should have a solid grasp of Python, especially working with files, lists, dictionaries, tuples, and other basic programming concepts, just as you would in any other language.
You should also be familiar with machine learning concepts—understanding data, models, and the general process of training machine learning models.
Since chatbots involve machines interacting with humans, it's important to have a strong understanding of natural language processing (NLP). You should know NLP concepts, pipelines, and the role each component plays in the process.
The sources I've used to learn these concepts are:

Once you're comfortable with these topics, we can start planning!

Steps involved in building the chatbot

Scope of the work (SOW):

The Scope of Work defines what features I plan to implement initially, as it's not feasible to develop everything at once. Defining the scope helps focus on the essential features that provide value right from the start.

According to Codebasics (link provided below), one effective way to prioritize features is by identifying those that have high impact and high feasibility—meaning they are easy to implement and can significantly improve the user experience.
Minimum Viable Product (MVP):
The MVP is the core version of the product that includes just enough features to allow early testing and feedback. It’s important to focus on delivering value quickly and avoiding the temptation to add too many features at the start.

In my case, I decided to narrow down the scope by focusing on specific areas. For example, I started by covering the syllabus of certain courses or sticking to answering general FAQs rather than tackling everything at once. This approach helps ensure steady progress and reduces complexity.
Solution Design and Architecture:
As the title suggests, this involves identifying the technologies, tools, and frameworks we will use to build the chatbot. There are numerous chatbot frameworks available online, both paid and free, that can simplify the development process.

However, since I wanted to learn the underlying technology from the ground up, I decided to build it from scratch. The technologies and frameworks I’ve chosen will be detailed in part 2, where we’ll dive into the actual coding.
Gathering the Data:
In my case, the data consists of the documents that the chatbot will reference. For now, it is limited to the syllabus of specific courses. Gathering and organizing this data is a crucial step since the chatbot will rely on these documents to answer user queries accurately.
Start building

We can now start building our chatbot from the ground up. While there are plenty of online resources to refer to, in this article, I'm sharing the ones that helped me learn.

Resources I have used

I get how frustrating it can be to sit through a 3-hour YouTube video, only to realize there are hundreds of similar ones online. It can be overwhelming to pick the right resource from so many options. To help make things easier, I’ve listed the resources that personally helped me get started on the project. I'm not saying other resources aren't good or that you shouldn't check out other articles, videos, or documentation—these are just the ones I found useful and liked.

NLP tutorial python playlist: https://www.youtube.com/playlist?list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX

Understanding how RAG works: https://www.youtube.com/watch?v=TRjq7t2Ms5I

Langchain Masterclass : https://www.youtube.com/watch?v=yF9kGESAi3M&t=9289s

Langchain playlist: https://www.youtube.com/watch?v=tEL833CPhqw&list=PLTDARY42LDV6flFgQLJCcVSXXa58mZ9Ty

Langchain multi pdf tutorial: https://www.youtube.com/watch?v=dXxQ0LR-3Hg&t=3226s

Multi-document RAG chatbot using Streamlit: https://www.youtube.com/watch?v=3ZDVmzlM6Nc

I'm learning as I go, so if you spot anything I could improve, feel free to share your feedback. ✨

🤖 Here is the GitHub link to my project: RAG_Chatbot