Retrieval-Augmented Generation (RAG)
Key Takeaways
The video explains the Retrieval-Augmented Generation (RAG) model, which combines Dense Passage Retrieval with a Seq2Seq BART generator, and is tested on knowledge-intensive tasks like open-domain QA, jeopardy question generation, and FEVER fact verification, using tools like Hugging Face Transformers library, FAISS, and Wikipedia corpus.
Full Transcript
this video will explain the retrieval augmented generation rag model developed by researchers at facebook and recently open sourced in the hugging face transformers library the idea of this model is to augment language models with context so instead of just using the input sequence x to generate output text y you would also prepend retrieve document z to the input sequence x so the generated text y is a product of the input of x as well as retrieve document z the generated text y can be adapted to any downstream task like classification or semantic similarity as in the text input text output task setup for natural language processing these prepended documents help language models dramatically with respect to generating factually correct text and performing knowledge intensive tasks such as fact verification or open domain question answering the way the documents are retrieved is very interesting the authors use siamese bur encoders of 100 word snippets from a wikipedia corpus as well as the input x sequence treated as a query this is new compared to traditional information retrieval that relies on heuristics like tf idf or bm25 sparse heuristically crafted feature vectors it's also interesting to see the modularity of this algorithm the authors use a pre-trained document index and query encoder integrated with a pre-trained bart generator and you could also imagine swiping out the non-parametric external memory source in this case wikipedia with something like a knowledge graph or maybe a hybrid structured and unstructured external knowledge source this video will explain the components of this retrieval augmented generation algorithm a bit about the data sets it's tested on and some interesting characteristics of these models compared to closed book language models this video will explain the paper retrieval augmented generation for knowledge intensive natural language processing tasks language models refers to these deep neural networks that are trained to perform this task of predicting a mast out token this can be done in an auto regressive way where you use all the context on the left to predict a massed out token on the right or just a mass token all the way at the end of the sequence to just generate text when you don't even have uh when you're not even predicting some ground truth labeled sequence and you slide that mask on to produce a new generated sequence so you can also have these masks in the middle of the sequence like in the burp model and then use the left and right context to predict these intermediate tokens so the current generation of language models take in this input x and use it to generate some text but the idea behind retrieval augmented generation and also papers like realm or orqa is that they're looking to add context to the language models to improve their performance on knowledge intensive or factual kind of tasks the idea would be instead of just having the sequence x you would also do information retrieval to fetch some document z from some kind of database or some kind of in this case a wikipedia index and then you append these documents to the input to facilitate generating text so it's similar to this idea in gpthree where you have this in context learning except in the case of in-context learning you're talking about appending demonstrations of the task to the input sequence here you're just demonstrating appending documents that provide information for the generation so this current generation of language models that do not have context for one you can't easily expand or revise their memory if you want to change a fact like say event x happened in 1982 compared to 1976 just training on this one example isn't going to permanently change this implicit knowledge stored in the parameters of the neural network without access to any kind of retrieval or context additionally you can provide insight into their predictions it's really hard to decode what causes it to generate some text and it might hallucinate and generate false factual knowledge as it does this mass mass mask generation here's a high level overview of how the rag algorithm works when information retrieval meets language modeling so we start off with the non-parametric external memory parametric memory would refer to knowledge that's implicitly stored in the weights of a neural network this non-parametric knowledge refers to 100 words samples from a wikipedia corpus so we have this uh big set 21 million of these 100 word uh snippets from this wikipedia corpus and we're going to encode each of these 100 word sequences with a document encoder and then when we ask a query like we have this new x sequence with a mask at the end of it we're going to treat that like a query and code that query and then use this maximum inner product search that's implemented with this face library to find the most similar documents that have been encoded in our non-parametric memory so the way that the query is encoded in the document is encoded is through the use of the sentence bert siamese bird two tower model kind of architecture the idea is that ins the burnt model might do semantic similarity natural language inference comparison of two different sequences by taking them both in as input and then having this cross attention on both of the sequences compared to these siamese architectures they're used to taking in just a single sequence and producing a representation from the single sequence that's used to do comparison with either cosine similarity or passing them through a soft max and doing some kind of loss function that way so the generator is going to take in these input documents and use it to append it to the context x to produce outputs y firstly we'll describe how the dense passage retrieval is integrating neural information retrieval to fetch the context for the bar generation model so as mentioned previously we have this siamese network two tower architecture to compute the representation of the documents where we have a hundred words in each of these documents and there's 21 million of these 100 word sequences extracted from wikipedia so each of these documents goes through a separate encoder the dfz the document encoder and the query encoder are two separate 110 million parameter bert base base models that do not share the same parameters the query encoder is going to be used to encode these queries so anytime we have an x that we're using say it's just a sequence with a mask at the end of it we're going to encode that as a query and then use it to go find the most similar documents in the non-parametric encoded these wikipedia sequences so a core idea to this is that we're not going to be training the document encoder so this is in comparison with a paper titled realm where they do rebuild this document index and update the encoding of these 100 word sequences from wikipedia but the idea here that's interesting is when you just encode them one time you can build this document index that facilitates vector similarity search so anytime you encode one of these vector these queries and then turn it into this dense vector and then you want to look up the most similar vectors in this database of vector encodings you can compute these centroids in the document that make it so you don't have to do the comparison with all the vectors in the database in the wikipedia corpus and you can just speed up the search dramatically it's also implemented with this face library that does this maximum inner product search and it's accelerated in all these other ways that are in more detail than what's described here our description of dense passage retrieval describes how the z documents are fetched from the wikipedia corpus or these hundred word slices from the wikipedia corpus and now we'll look at how the bart model is going to generate tokens y sub i given an input sequence x and previously generated tokens y sub 1 to i minus 1. so there are two different ways they propose to decode from this set of latent documents and they call this marginalizing over the documents referring to summing up over the different documents that are retrieved because when you do this similarity between the documents and the query that there might be say five to ten documents that are really similar with the query encoding so you're going to feed each of these different z sub i's the top k most similar latent document z or the 100 word sequences with the query encoding so in the rag token way of decoding from this you're going to take the product over each token so at each step of generating a token y sub i you're going to integrate this different latent document z sub i and then you're going to multiply it by the p sub eta these are the parameters of that query and the this is the um the matching between the document and the query that's going to have this prior probability on the likelihood of this z sub i document in the top k to begin with the p rag sequence model is going to just take one latent document and generate an entire y one to n or however long the sequence is going to be and then you're going to do that for each of the z's and then try to multiply together the probabilities of those entire sequences but they're later on going to describe more about how the beam search is used to actually decode from these models so there is a little more behind the details of how they decode from the rag sequence and rag token but from a high level idea the bart model is a sequence to sequence model it's going to encode the entire sequence so in this case it could be the z latent documents the input x and then y one up to i minus one it's going to encode that entire concatenated sequence and then it's gonna start decoding and reconstructing the sequence and then at the end of it it'll put the next y sub i token so the idea behind beam search this is an illustration of a greedy search just showing the path that's traversed along the generation from this blog post is linked in the description of this video so you see at each time step it might it might take in the as y sub i minus one and then it can either predict as y sub i dog nice car and then it puts a probability on each of these generations so in this case it put fifty percent probability on nice forty percent on dog and then ten percent on car so in this case of the uh p-rag token p-rag sequence we're weighting each of these probabilities by the parameters of theta which are the parameters of the bar model as well as the similarity that is determined the prior probability on that document that was retrieved because we have say five to ten of these latent documents and we're weighting the probability of the tokens generated by the probability of retrieving that latent document in the first place as we appended it to the context so that's how you go about decoding this and there's more details about how exactly they do this in the paper so hopefully that was a decent overview of how the neural information retrieval model fetches latent document z to be inputted with the x into this bar sequence to sequence model to generate a new sequence so another really interesting detail about the implementation of this paper is the way that they take these off-the-shelf pre-trained models and integrate them into this framework and then train it further so you start off with a pre-trained bar generator so this bar model that's doing the uh the encoding and decoding has been pre-trained on language modeling it's one of these open source models on something like hugging face and they also have the pre-trained dense passage retrieval and this is the training of the document query encoding by fetching documents that contain answers to questions in natural questions and web questions and they also have the non-parametric external memory the wikipedia corpus so you could imagine taking out any one of these individual components and replacing it with something else to do the generation to do the document coding and the query encoding or the non-parametric external memory source you can imagine maybe swiping this out with something like a knowledge graph or some other kind of memory source these are some of the data sets that the authors test out the retrieval augmented generation model on these data sets are designed to be open domain question answering generally open domain means that you expect these tasks to have to kind of fetch some information in order to answer them compared to say a closed book question answering task in closed book you mean that the neural network should be able to store all the knowledge it needs to perform the task in its own parameters it shouldn't need to rely on some kind of external memory source but these tasks are more designed to be knowledge intensive tasks and these authors have another paper where they benchmark these different kinds they produce this big list of different kinds of data sets that fall under this category of being knowledge intensive things like the fever fact verification data set or being able to generate a jeopardy question given only the answer so this natural questions data set it says open domain question answering where you have questions about say john wilkes booth airplane mode or the royal sign manual and you have to generate the answer so a lot of these previous approaches would treat this as a classification problem so in the squad extractive question answering task you might take in this fetched passage as input and the model would classify where the answer is so the output would be say indexes 10 through 15 is where the answer is compared to these text input text output models where it's going to generate the answer it's not just going to classify the position in the input text so it's going to retrieve this context and it's going to generate the answer the text the answer rather than classifying where it is in the context span so it's a similar idea in the trivia qa data set where you have these questions and you have to generate these answers and they're very knowledge intensive and require factual information to produce these kinds of answers so they also look at this ms marco data set and this is a really interesting data set where it has these queries that are issued to bing so the queries are written in this kind of natural language the way that people type in searches into a search engine and then you have the top 10 passages and then the best answer is human annotated this is a description of the annotation interface that's used to take the top 10 passages that are returned from bing from this query and then find the best answer that is human annotated so this is another knowledge intensive task to be able to just answer any kind of search engine query with this kind of wikipedia corpus non-parametric memory the neural information retrieval system and then this bart text output generator compared to classifying the answer within one of these spans so then we also have the fever data set and so these are 185 000 uh claims that are generated from wikipedia so the way they annotate this is they sample some text from wikipedia and then the human annotator tries to devise some kind of claim like barbara bush was a spouse of the united states president during his term and that would be an example of a claim that is supported in the document and then another annotator would try to perturb this a little bit such that it's refuted by the evidence in the wikipedia article if you are interested further in these kinds of data sets i recommend checking out this paper kilt a benchmark for knowledge intensive language tests these are results comparing the rag token and rag sequence models which are two different ways of incorporating the latent document z into the bar generation compared with the ram approach that continually rebuilds the document index and their previous paper dense passage retrieval which is extended in this paper retrieval augmented generation they're also comparing it with this closed book t5 model with 11 billion parameters as well as the variant of t5 that's trained with salient span masking and saline span masking was discovered to be better for training pre-training these models that eventually perform these knowledge intensive tasks like natural questions trivia qa web questions or curated trek so this is a really interesting test because the rag token rag sequence they have this bart model with about 400 million parameters compared to the t5 with 11 billion parameters and they're both looking at completely different ways of accessing the knowledge in these neural networks so the rag model is going to go fetch this context and append it to the sequence whereas the t5 model is just going to store all of the knowledge in the parameters of the neural network and it's using 11 billion parameters to do such a task but we see a massive performance difference with the retrieval augmented generation compared to the 11 billion parameter t5 model the authors also test out the retrieval augmented generation model on the task of producing jeopardy questions given the subject answer so provided a subject like hemingway it generates this jeopardy question such that the answer would be what is hemingway that kind of jeopardy style so here they're using human assessments to compare the difference between rag and bart so rag is where we're fetching this context and appending it to the subject hemingway so it'd be hemingway and then in front of this would be these fetch documents about hemingway that facilitate generating this question so it's showing this difference between bart or the rag token decoding model and the humans saying that the rag token model is better than bart or bard is better than rag token or both are good both are bad or they can't tell or something like that and they're assessing this along how factual the generated questions are which is a huge deal in these knowledge intensive tasks where you don't want these natural language processing models that you're relying on for these tasks to producing text that isn't factually correct earlier we described that rag token and rag sequence are two different ways of marginalizing across all the different retrieved documents z to produce the output y so in this case they're looking at how much each document documents one through five impacted a generated token say this is y sub i and this context to the left of it at any point novel and then this whole context to the left is y one to i minus one so they're showing the the darker blue squares indicate that document two had a massive impact on generating sun and this had a massive impact on generating uh so these are two different i think book titles from that are written by this author and you're seeing that having fetched these two different documents you'd imagine this one this document that's been fetched contains information about a farewell to arms and this contains information document two this is information about the sun also rises so this is showing how the probability is put on these different latent documents z sub i to generate this y sequence so back to the topic of comparing bart with rag these are some different examples of this input from the ms marco task or the jeopardy question generation task and the differences in generation between bart and rag and in some cases such as washington we see where bart says that this state has the largest number of counties in the u.s which isn't actually true and even though it looks true and then rag actually has produced these true statements because of the way it's retrieving this context and using it to generate these questions so these are some examples comparing bart with retrieval augmented generation this figure is showing some ablation results of different factors of the algorithm so the first is the number of retrieved documents and how that changes as you retrieve more documents so say we fetch 50 different of these 100 word sequences from our 20 million wikipedia corpus will that improve the performance with respect to the natural questions exact match score which is where you have this supervised label of the answer and you're seeing the exact match between the generated answer and the ground truth answer so this shows that the model doesn't continue to approve in the case of the rag token model as you retrieve more documents even though the rag sequence models seems to continue to improve even if it saturates enormously after say 20 retrieved documents so in the middle is showing the impact of fine-tuning this query encoder so earlier we mentioned that they don't continue to update the document encoder with respect to how you encode each of these 21 million 100 word sequences but you do continue to update the query encoder that's used to go and find the most similar document in that document index so this is showing a huge difference between having a fixed dense passage retrieval model compared to fine-tuning the query encoder and then particularly a huge difference between this bm25 which is these uh sparse features similar to tf-idf that are used to describe these documents and do that kind of information retrieval without neural models at all remembering that the dpr and the rag query and document encoders are both these siamese bert architectures with 110 million parameters each at the very end of the appendix the authors describe this problem of retrieval collapse where regardless of what the input x is that the query encodes this x it always will retrieve the same documents from that document index d of z so it sounds somewhat similar to gans that produce the same image despite a different z input vector and you call that mode collapse so they do cite that the generator learns to ignore the retrieved information once this starts happening and just generated from the implicit knowledge that's stored in the parameters remembering that this bart model is a 400 million parameter pre-trained model and it does already have a lot of implicit knowledge in the weights of the neural network but it's not really super clear from the appendix section and overall the paper how problematic this retrieval collapse thing is and how much it really impacted their experiments so here are some ideas that could make the retrieval augmented generation model even more powerful or just be interesting to explore so in this paper we're looking at this wikipedia slice as being the non-parametric memory source but it might be also interesting to see how this structured information in the form of knowledge bases we have these triplets of entity relation other entity and how this might be useful as the non-parametric memory that's used to augment the context of these retrieval augmented generation models so another interesting idea is how can we learn better representations for the document index so the document index is pre-trained to fetch documents that contain the answer span to natural questions and trivia question answering maybe there are strategies such as contrast of learning or some kind of self-supervised learning a task that could improve the representations that are learned by this document encoder and thus facilitate this and maybe alleviate that retrieval collapse problem although i'm not sure exactly how much of a problem that is so then another interesting thing is how will longer input sequence length impact this so this survey efficient transformers looks at models like reformer linformer performer long former there's also models like transformer xl and compressive transformer that use memory in the form of like kind of a recurrent memory structure but how will attending over a longer sequence play with this having a longer context sequence from fetched neural information retrieval so we've looked at how retrieving more documents helped but in that experiment of going from 10 20 30 40 retrieved documents you're still talking about marginalizing over them and summing up over the probabilities compared to some kind of cross attention where the input is say 5 000 words or some massive input sequence like that and you have that cross attention over the whole sequence compared to multiplying the probabilities out of these smaller sequences so it could be interesting to see how this efficient transformer is placed into architectures that integrate neural information retrieval with generation for a deeper dive into this paper i highly recommend watching this talk from patrick lewis about the paper that will be linked in the description of this video thanks for watching this overview of retrieval augmented generation hopefully this video made the high level algorithm clear how they use this document encoding in the query encoding and how they integrate that with the pre-trained bart model to have this context augmented generation of text and how they adapt this to these different tasks by generating this text input text output kind of format and how it can perform all these different knowledge intensive tasks i'm really excited about this kind of model and it's been open source in the hug and face library so if you do try it out please let me know in the comments how will your experiences with trying this model on your own problems thanks for watching and please subscribe to henry ai labs for more deep learning and ai videos
Original Description
This video explains the Retrieval-Augmented Generation (RAG) model! This approach combines Dense Passage Retrieval with a Seq2Seq BART generator. This is tested out on knowledge intensive tasks like open-domain QA, jeopardy question generation, and FEVER fact verification. This looks like a really interesting paradigm for building language models that produce factually accurate generations!
Thanks for watching! Please Subscribe!
Paper Links:
Original Paper: https://arxiv.org/pdf/2005.11401.pdf
FB Blog Post (Animation used in Intro): https://ai.facebook.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models
HuggingFace RAG description: https://huggingface.co/transformers/model_doc/rag.html
Billion-scale similarity search with GPUs: https://arxiv.org/pdf/1702.08734.pdf
Language Models as Knowledge Bases? https://arxiv.org/abs/1909.01066
REALM: Retrieval-Augmented Language Models: https://arxiv.org/pdf/2002.08909.pdf
Dense Passage Retrieval: https://arxiv.org/pdf/2004.04906.pdf
FEVER: https://arxiv.org/pdf/1803.05355.pdf
Natural Questions: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/1f7b46b5378d757553d3e92ead36bda2e4254244.pdf
TriviaQA: https://arxiv.org/pdf/1705.03551.pdf
MS MARCO: https://arxiv.org/pdf/1611.09268.pdf
Thanks for watching!
Time Stamps
0:00 Introduction
2:05 Limitations of Language Models
4:10 Algorithm Walkthrough
5:48 Dense Passage Retrieval
7:44 RAG-Token vs. RAG-Sequence
10:47 Off-the-Shelf Models
11:54 Experiment Datasets
15:03 Results vs. T5
16:16 BART vs. RAG - Jeopardy Questions
17:20 Impact of Retrieved Documents zi
18:53 Ablation Study
20:25 Retrieval Collapse
21:10 Knowledge Graphs as Non-Parametric Memory
21:45 Can we learn better representations for the Document Index?
22:12 How will Efficient Transformers impact this?
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Connor Shorten · Connor Shorten · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
DenseNets
Connor Shorten
DeepWalk Explained
Connor Shorten
Inception Network Explained
Connor Shorten
StackGAN
Connor Shorten
StyleGAN
Connor Shorten
Progressive Growing of GANs Explained
Connor Shorten
Improved Techniques for Training GANs
Connor Shorten
Word2Vec Explained
Connor Shorten
Must Read Papers on GANs
Connor Shorten
Unsupervised Feature Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Embedding Graphs with Deep Learning
Connor Shorten
Transfer Learning in GANs
Connor Shorten
ReLU Activation Function
Connor Shorten
AC-GAN Explained
Connor Shorten
SimGAN Explained
Connor Shorten
DC-GAN Explained!
Connor Shorten
ResNet Explained!
Connor Shorten
Graph Convolutional Networks
Connor Shorten
Neural Architecture Search
Connor Shorten
Henry AI Labs
Connor Shorten
Video Classification with Deep Learning
Connor Shorten
BigGANs in Data Augmentation
Connor Shorten
Introduction to Deep Learning
Connor Shorten
EfficientNet Explained!
Connor Shorten
Self-Attention GAN
Connor Shorten
Curriculum Learning in Deep Neural Networks
Connor Shorten
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
Deep Compression
Connor Shorten
Skin Cancer Classification with Deep Learning
Connor Shorten
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
The Lottery Ticket Hypothesis Explained!
Connor Shorten
SqueezeNet
Connor Shorten
GauGAN Explained!
Connor Shorten
AutoML with Hyperband
Connor Shorten
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
Weakly Supervised Pretraining
Connor Shorten
Image Data Augmentation for Deep Learning
Connor Shorten
Unsupervised Data Augmentation
Connor Shorten
Wide ResNet Explained!
Connor Shorten
RevNet: Backpropagation without Storing Activations
Connor Shorten
GANs with Fewer Labels
Connor Shorten
BigBiGAN Unsupervised Learning!
Connor Shorten
Self-Supervised Learning
Connor Shorten
Multi-Task Self-Supervised Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Population Based Training
Connor Shorten
Show, Attend and Tell
Connor Shorten
Siamese Neural Networks
Connor Shorten
WaveGAN Explained!
Connor Shorten
VAE-GAN Explained!
Connor Shorten
Evolution in Neural Architecture Search!
Connor Shorten
AI Research Weekly Update August 18th, 2019
Connor Shorten
Weight Agnostic Neural Networks Explained!
Connor Shorten
AI Research Weekly Update August 25th, 2019
Connor Shorten
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
CoDeepNEAT
Connor Shorten
AI Research Weekly Update September 1st, 2019
Connor Shorten
Randomly Wired Neural Networks
Connor Shorten
Genetic CNN
Connor Shorten
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss
Medium · Machine Learning
Stop Guessing: Guaranteed Structured Output from LLMs in Node.js
Dev.to · Hardik Mehta
Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)
Dev.to AI
Notes: Memory, Context, and Large Language Models (LLMs)
Dev.to · Vladimir Panov
Chapters (15)
Introduction
2:05
Limitations of Language Models
4:10
Algorithm Walkthrough
5:48
Dense Passage Retrieval
7:44
RAG-Token vs. RAG-Sequence
10:47
Off-the-Shelf Models
11:54
Experiment Datasets
15:03
Results vs. T5
16:16
BART vs. RAG - Jeopardy Questions
17:20
Impact of Retrieved Documents zi
18:53
Ablation Study
20:25
Retrieval Collapse
21:10
Knowledge Graphs as Non-Parametric Memory
21:45
Can we learn better representations for the Document Index?
22:12
How will Efficient Transformers impact this?
🎓
Tutor Explanation
DeepCamp AI