Retrieval Augmented Generation (RAG)
Key Takeaways
This video teaches how to implement Retrieval Augmented Generation using Milvus vector database and OpenAI GPT in Python
Full Transcript
hi everyone welcome to another video from Pam Studio this is our first video in 2025 in the last video we covered some uh techniques for enhancing the llm capabilities that included U prompt engineering uh llm fine-tuning as well as uh rack so in this video we are going to implement a rack system from scratch so let's get to it so today we will start with a brief recap of rag then we will see vector databases what they do and see a few popular Vector databases next we will see some of the main uh embedding models that we can use for building a rack system and finally we'll go over an implementation of a rack system so this implementation uh will help in understanding different components what each component does and uh how they are uh working together in a r system as we mentioned in the previous video llms have acquired a parametric knowledge during their training this parametric knowledge is good for General use cases but there are certain limitations to that llms have a knowledge cut off date and they cannot answer questions about recent events beyond their cut off and also they are not suitable for domain specific applications uh with propriatary information that are not publicly available and finally updating llms with new knowledge is not easy to overcome these limitations we can use uh Rack or retrieval augmented generation essentially rag is a technique that uh given a query from user it finds the most relevant information from a database and then feed that information to the llm then the llm will answer the user's question based on that information Vector databases are specialized databases that are designed for efficiently storing and searching through high dimensional data they essentially implement the approximate uh nearest neighbor algorithms which is a class of algorithms for approximating K nearest neighbors vctor databases have applications in Search and recommendation systems and they are an essential component of a rack system some popular Vector databases are included in this slide including uh pine cone vv8 miles uh F or Facebook AI similarity search and chroma some of these provide uh Cloud only support whereas the other ones provide both cloud and local deployment in this video we will use milest uh as it is open source and uh very easy to install and use locally another important component of a rack system is the embedding model the embedding model will generate cre High dimensional vectors that capture semantic meanings of their input text some popular choices are sentence Transformer B AI General embedding or BGE as well as the models from open AI you can see that the dimensionality of these embedding vectors generated from these models ranges from 384 to 372 so these vectors will be stored in the vector databases that we mentioned previous ly for fast search for this implementation we are going to use the sentence Transformer as our embedding model and then we store the text Data along with their High dimensional vectors in a mest database and for generating the final response in natural language we use the open AI gp4 model so let's see the implementation step by step for this implementation we want to build a rack system for paper titled vrag Vision based retrieval augmented generation on multimodality documents you can download the paper from archive with the link in the description of this video then we use Pi PDF 2 to read the PDF file in Python Page by Page using this uh PDF reader and then we concatenate the text together to make a long string after that we do some cleaning because the lines in the PDF are ended with sln so we remove the unnecessary new lines next I have defined this function called split text uh to split the text into chunks with a given chunk size and overlap as input arguments this function will return a list of text chunks uh for this example I used chunk size of 2,000 and overlap of 500 characters now we are ready to generate the embeddings for each chunk for this example we use the sentence Transformer using this model named all mini LM L6 V2 then we call a model that encode on each chunk and get an embedding Vector for them the next step is to store the chunks and their Associated embeddings for lookup as we said earlier we use the milest vector database for this part so we import milest client and create a database locally then we create a collection with name visra paper if a collection with this name already exist in our database we can drop it and recreate the collection after that we can insert the chunks and their embeddings into the collection so first first we uh reformat our data such that we will have a list of dictionaries each dictionary contains an ID which is a unique identifier for the chunk the embedding Vector given as a python list and finally the actual text of the chunk then we insert the data into our collection now we can test the vector search functionality let's give an example query that is in visra paper or in visra retrieval how the final embedding is generated we use the same embedding model and generate the embedding Vector for this query then we can find the most similar chunks to this query this is done by uh comparing this embedding Vector uh with the embeddings of chunks that are stored in the database this will retrieve two chunks that are most similar to this query now we need to set up the open AI client I've already created an open AI API from my account and have stored the open AI API key in a n file located in the same directory of this uh notebook so by importing the N package I can load the open AI API key as an environment variable and then set up the open AI client as shown here now we can generate a response using open AI GPT 40 model our prompt contains this message answer the question about the visra paper followed by the query for testing purposes first uh let's not include the retrieve chunks and uh see what response the GPT model gives us so when we call the open AI uh client chat completion API with this message we get uh this response however this response is um completely unrelated to the paper and in fact it is generated based on the parametric knowledge of GPT now we call gp40 again with the same message and also include the retriev chunks by appending them to the end of the message then our prompt to GPT is ask asking to answer the question based on the retrieved information not using the its own parametric knowledge and now we can see the correct response so this response uh is more relevant and we can see the relevant um paragraph from the paper that describes the process of calculating the embeddings thanks everyone for watching I hope this video was useful for understanding how rag works in the next video video we will cover fine-tuning so we will describe uh different algorithms for fine-tuning llms and also we'll see how to use them so until next video
Original Description
Implementing a RAG system in python, using Milvus vector database, and OpenAI GPT.
#RAG #LLM #gpt
Link to code: https://github.com/PyML-studio/mlstudio/blob/main/Notebooks/build_RAG/rag_from_scratch.ipynb
VisRAG paper: https://arxiv.org/abs/2410.10594
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: RAG Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Why you shouldn’t search your documents directly with AI
Medium · Programming
Your AI Keeps Making Things Up. RAG Is How You Make It Use Real Facts Instead.
Medium · RAG
Evaluation Metrics for RAG: Measure Retrieval, Generation, and End-to-End Quality With Numbers That…
Medium · AI
Evaluation Metrics for RAG: Measure Retrieval, Generation, and End-to-End Quality With Numbers That…
Medium · Data Science
🎓
Tutor Explanation
DeepCamp AI