The Illustrated Retrieval Transformer

Jay Alammar · Beginner ·🧠 Large Language Models ·3y ago

Key Takeaways

This video explains the Retrieval Transformer, a language model that achieves GPT-3 like performance by querying a database or searching the web

Full Transcript

language models are able to generate text better than any software system we've had before and the larger the model the more the training data it was trained on the better the text that it generates there's a challenge here however in that these large models need super computers being able to run them and deploy them vast amount of resources go into that a lot of know-how is required to get that working so the question is how can we make these models smaller one of the leading ideas to empower smaller models to have great text generation capabilities is retrieval this is the idea that the model does not need to store the world's information into its parameters and when deployed we don't need to load all of that information into GPU memory retrieval language models boost the model with a database or the ability to retrieve information from a search engine like Google and the retrieved information from that database or from the internet helps it generate text in this video we'll be going over my article explaining deepminds retro Transformer which at four percent the size is comparable to gpt3 at language intensive tasks this is a very exciting direction that is explored in retro Transformers but also in other models like memorizing Transformers like web GPT and even Google's Lambda honestly this is an idea so powerful that it almost seems wasteful for future language models to be created without this this ability it also has nice properties like the ability to curate and filter the data that the model has access to and potentially also the ability to update the model's data without the need to retrain the model from scratch with that let's look at retrieval Transformers so these cool developments in retrieval and the ability to empower Transformer models with a retrieval or a search component is very exciting and it's we saw a couple of these papers towards the end of 2021 this being one of them the Retro Transformer being one of them web GPT being another one and then memorizing Transformers was released after that and they were all around this very cool idea that you do not need a model with 175 billion parameters it's not 85. it's you don't need to store all of the world's information in the parameters of the model and reload them into GPU memory you can have a Model A language model four percent of the size but then help it with a retrieval component and that injects information from other sources from the Internet or from a database in this case into this model and so um here's an example of let's say we have this prompt that we give the model and the model needs to complete it so if we say the Dune film was released in this is something that the model has to have this factual information just knowing knowledge is not enough for this but here's another example of a of a prompt it's popularity spread by Word of Mouth to allow Herbert to start working full you don't need factual information to fill this in you need maybe just you have to understand the probabilities of words and from here I there are these two components potentially of of language models that are both represented in these large GPT models so language information and then World Knowledge information and I love that this new approach of retrieval your language model needs to maybe have a focus a little bit on language information and the world knowledge information can be outsourced and it's better for all of that information to not be stored within the model so you don't need to train the model every time you upload it so we can use Advanced and high performance databases for example to retrieve very quickly instead of inefficiently storing all of that into this massive model so this is I think a visual that captures a little bit of why this is exciting and why after this comes out it seems that this is really the way to go um and it sort of helps you audit the sources of this information and you can have a little bit more more control over where this information was sorted from and what information sources does a model have access to I'm really excited about this area here we talk a little bit about how this one model does it through his Retro by by Deep Mind so it has a built-in database of keys and values and then when it gets an input prompt it gets the tokens and makes it creates a sentence embedding of that and then it uses that as a query to see where has the models database seen words or sentences that are similar to this so this is the retrieval part it uses the prompt the input prompt as a as a query into the database and then from there it gets two things it gets this similar sentence that is similar to the input but also the sentence after it so to speak because that's maybe a lot of the times where that information would be would would and so these are all fed into a retro so it gets the input prompt but it also gets this relevant information and that this system has deemed let's say relevant to this so it provides a number of different neighbors I think it's to a training time but then at you can do more like you're not you're not limited to that a little bit more uh details about the actual architecture or structure so you have a encoder stack and a decoder stack and the encoder is mainly to encode the the neighbors into let's say keys and values and then when you have the prompt that goes into the decoder and specific layers in the decoder specific decoder blocks this retro decoder block has access to this encoder through the keys and values so these layers layer number let's say nine here can look back at the relevant information and brings the relevant information from the database memorizing Transformer does it a little bit you know in a different way maybe a way that's maybe simpler than the one specifically used here but this is a very interesting direction that I super excited to see us taking because it allows us to use much smaller language models and have other better retrieval tools to augment its information

Original Description

The latest batch of language models can be much smaller yet achieve GPT-3 like performance by being able to query a database or search the web for information. A key indication is that building larger and larger models is not the only way to improve performance. This video provides a gentle intro RETRO, DeepMind's retrieval-augmented Transformer. There are more details in the associated blog post: https://jalammar.github.io/illustrated-retrieval-transformer/ Corrections: 1:15 - *knowledge-intensive tasks --- Contents: Introduction (0:00) Retrieval-enhanced language models (0:40) The Illustrated Retrieval Transformer (1:55) World-knowledge vs. language knowledge (2:57) The database and how to query it and retrieve information (4:45) Feeding the prompt and retrieved information into the model (5:39) The architecture of the model (6:00) --- Paper: https://www.deepmind.com/publications/improving-language-models-by-retrieving-from-trillions-of-tokens Authors: Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, Laurent Sifre ---- Twitter: https://twitter.com/JayAlammar Blog: https://jalammar.github.io/ Mailing List: https://jayalammar.substack.com/ --- More videos by Jay: Experience Grounds Language: Improving language models beyond the world of text https://www.youtube.com/watch?v=WQm7-X4gts4 Language Processing with BERT: The 3 Minute Intro (Deep learning for NLP) https://youtu.be/ioGry-89gqE Explainable AI Cheat Sheet - Five Key Categories https://www.youtube.com/watch?v=Yg3q5x7yDeM The Narrated Transformer Language Model https://youtu.be/-QH8fRhqFHM
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Jay Alammar · Jay Alammar · 22 of 38

1 Jay's Visual Intro to AI
Jay's Visual Intro to AI
Jay Alammar
2 Making Money from AI by Predicting Sales - Jay's Intro to AI Part 2
Making Money from AI by Predicting Sales - Jay's Intro to AI Part 2
Jay Alammar
3 How GPT3 Works - Easily Explained with Animations
How GPT3 Works - Easily Explained with Animations
Jay Alammar
4 The Narrated Transformer Language Model
The Narrated Transformer Language Model
Jay Alammar
5 My Visualization Tools (my Apple Keynote setup for visualizations and animations)
My Visualization Tools (my Apple Keynote setup for visualizations and animations)
Jay Alammar
6 Explainable AI Cheat Sheet - Five Key Categories
Explainable AI Cheat Sheet - Five Key Categories
Jay Alammar
7 The Unreasonable Effectiveness of RNNs (Article and Visualization Commentary) [2015 article]
The Unreasonable Effectiveness of RNNs (Article and Visualization Commentary) [2015 article]
Jay Alammar
8 Neural Activations & Dataset Examples
Neural Activations & Dataset Examples
Jay Alammar
9 Up and Down the Ladder of Abstraction [interactive article by Bret Victor, 2011]
Up and Down the Ladder of Abstraction [interactive article by Bret Victor, 2011]
Jay Alammar
10 Probing Classifiers: A Gentle Intro (Explainable AI for Deep Learning)
Probing Classifiers: A Gentle Intro (Explainable AI for Deep Learning)
Jay Alammar
11 Inspecting Neural Networks with CCA - A Gentle Intro (Explainable AI for Deep Learning)
Inspecting Neural Networks with CCA - A Gentle Intro (Explainable AI for Deep Learning)
Jay Alammar
12 Language Processing with BERT: The 3 Minute Intro (Deep learning for NLP)
Language Processing with BERT: The 3 Minute Intro (Deep learning for NLP)
Jay Alammar
13 Behavioral Testing of ML Models (Unit tests for machine learning)
Behavioral Testing of ML Models (Unit tests for machine learning)
Jay Alammar
14 Favorite AI/ML Books: Intro to ML with Python (Book Review)
Favorite AI/ML Books: Intro to ML with Python (Book Review)
Jay Alammar
15 Favorite Python Books: Effective Python
Favorite Python Books: Effective Python
Jay Alammar
16 Favorite Stats Books: Seven Pillars of Statistical Wisdom
Favorite Stats Books: Seven Pillars of Statistical Wisdom
Jay Alammar
17 Understanding Animal Languages - Seeing Voices 2
Understanding Animal Languages - Seeing Voices 2
Jay Alammar
18 How digital assistants like Siri work #shorts
How digital assistants like Siri work #shorts
Jay Alammar
19 Writing Code in Jupyter Notebooks #shorts
Writing Code in Jupyter Notebooks #shorts
Jay Alammar
20 Experience Grounds Language: Improving language models beyond the world of text
Experience Grounds Language: Improving language models beyond the world of text
Jay Alammar
21 pandas for data science in python #shorts
pandas for data science in python #shorts
Jay Alammar
The Illustrated Retrieval Transformer
The Illustrated Retrieval Transformer
Jay Alammar
23 AI Image Generation is MIND BLOWING! #shorts
AI Image Generation is MIND BLOWING! #shorts
Jay Alammar
24 A Generalist Agent (Gato) - DeepMind's single model learns 600 tasks
A Generalist Agent (Gato) - DeepMind's single model learns 600 tasks
Jay Alammar
25 The Illustrated Word2vec - A Gentle Intro to Word Embeddings in Machine Learning
The Illustrated Word2vec - A Gentle Intro to Word Embeddings in Machine Learning
Jay Alammar
26 AI Art Explained: How AI Generates Images (Stable Diffusion, Midjourney, and DALLE)
AI Art Explained: How AI Generates Images (Stable Diffusion, Midjourney, and DALLE)
Jay Alammar
27 What is Generative AI? 4 Important Things to Know (about ChatGPT, MidJourney, Cohere & future AIs)
What is Generative AI? 4 Important Things to Know (about ChatGPT, MidJourney, Cohere & future AIs)
Jay Alammar
28 AI is Eating The World - This is Where YOU Can Use it to Compete (AI Product Moats)
AI is Eating The World - This is Where YOU Can Use it to Compete (AI Product Moats)
Jay Alammar
29 What is LangChain? Where does it fit with LLMs like ChatGPT and Cohere? #shorts
What is LangChain? Where does it fit with LLMs like ChatGPT and Cohere? #shorts
Jay Alammar
30 Are language models with more parameters better? #shorts #chatgpt
Are language models with more parameters better? #shorts #chatgpt
Jay Alammar
31 How to manage LLM prompts with tools like LangChain #languagemodels #chatgpt
How to manage LLM prompts with tools like LangChain #languagemodels #chatgpt
Jay Alammar
32 What is Llama Index? how does it help in building LLM applications? #languagemodels #chatgpt
What is Llama Index? how does it help in building LLM applications? #languagemodels #chatgpt
Jay Alammar
33 prompt chains are important for building large language model applications
prompt chains are important for building large language model applications
Jay Alammar
34 ChatGPT has Never Seen a SINGLE Word (Despite Reading Most of The Internet). Meet LLM Tokenizers.
ChatGPT has Never Seen a SINGLE Word (Despite Reading Most of The Internet). Meet LLM Tokenizers.
Jay Alammar
35 What makes LLM tokenizers different from each other? GPT4 vs. FlanT5 Vs. Starcoder Vs. BERT and more
What makes LLM tokenizers different from each other? GPT4 vs. FlanT5 Vs. Starcoder Vs. BERT and more
Jay Alammar
36 Building LLM Agents with Tool Use
Building LLM Agents with Tool Use
Jay Alammar
37 SWE-Bench authors reflect on the state of LLM agents at Neurips 2024
SWE-Bench authors reflect on the state of LLM agents at Neurips 2024
Jay Alammar
38 Learn how ChatGPT and DeepSeek models work: How Transformer LLMs Work [Free Course]
Learn how ChatGPT and DeepSeek models work: How Transformer LLMs Work [Free Course]
Jay Alammar

Related AI Lessons

Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints
Learn to resume LSTM training with checkpoints using PyTorch and Lightning AI, enabling efficient model iteration and development
Dev.to · Rijul Rajesh
How AI Learns with Less Labeled Data
Learn how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection
Medium · AI
Comparing Sarvam-30B and Qwen2.5–14B on Spider Text-to-SQL: An Active-Parameter Perspective
Learn how to compare large language models like Sarvam-30B and Qwen2.5-14B on the Spider Text-to-SQL benchmark from an active-parameter perspective
Medium · LLM
Claude Sonnet 5 closes the gap to Opus without the Opus bill
Claude Sonnet 5 emerges as a cost-effective alternative to Opus, learn how it closes the gap without the hefty price tag
Medium · LLM
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →