Retrieval Augmented Generation in the Wild: Anton Troynikov

AI Engineer · Intermediate ·🔍 RAG & Vector Search ·2y ago

Key Takeaways

The video discusses Retrieval Augmented Generation (RAG) in various AI applications, covering challenges, techniques, and tools such as chroma, landyards, and Nvidia's Voyager paper, with a focus on vector stores, human feedback, and self-updates.

Full Transcript

[Music] hi everybody as they said as I walked up I'm Anton I'm the co-founder of chroma I'm here to talk to you about retrieval augmented generation in the wild um and what it is that chroma is building for Beyond just Vector search so by now you've all seen versions of this probably a half dozen times throughout this conference this is the basic retrieval Loop that one would use in a rag application you have some Corpus of documents you embed them in your favorite Vector store which is chroma you I mean check the landyards man um you embed your you embed your Corpus of documents you have an embedding model for your queries you um find the nearest neighbor vectors for those embeddings and you return the associated documents which along with the query you then put into the llm context window and return some result now this is the basic rag Loop but I think of this as more like the open loop retrieval augmented generation application and my purpose in showing you all this is to show you that you need a lot more than simple Vector search to build some of the more powerful more promising applications that take rag in the future so let's get into what some of those might be the first piece to this of course is incorporating human feedback into this Loop previously you um without human feedback it isn't possible to adapt the data the embeddings model itself to the specific task to the model and to the user human feedback is required to actually return better results um for particular queries on your specific data on the specific task that you want to perform generally embedding models are trained in a general context and you actually want to update them for your specific task so basically the memory that you're using for your rag application needs to be able to support this sort of human feedback now the other piece that we've seen and these These are currently in the early stages uh but they're emerging as something like a capable machine and I think that one of the ways to make agents actually capable is a better rag system a better memory for AI and that means that your retrieval system your memory needs to support uh self updates from the agent itself out of the box all in all what this means is you have a constantly dynamically updating data set something that's built as a search index out of the box is not going to be able to support these types of capabilities next of course we're talking about agents with World models so other words the agent needs to be able to store its interaction with the world and update the data that it's working with based on that interaction and finally you need to be able to tie all of these together now this sounds like a very complex system that's uh Frontier research and it is currently research grade but we're seeing some of the first applications of this in the wild already today this is an animation from uh I'm sure some of you are familiar with this paper this is the famous Voyager paper out of Nvidia where they trained a agent to play Minecraft to learn how to play it by learning skills in a particular environment and then recognizing when it's in the same context and recalling that skill now the other interesting piece to this is several of the more complex skills were learned through human demonstration and then retrained in the retrieval system which of course was cha um my point in showing this to you is that the simple rag Loop might be the bread and butter of most of the applications being developed today but the most powerful things that you'll be able to build with AI in the future require much more uh a much more capable retrieval system than one that only supports a search index now of course in retrieval itself there are plenty of challenges information retrieval is is kind of a classic task and the setting in which it's been found previously has been in recommender systems and uh and in search systems now that we're all using this in production for AI applications in completely different ways there's a lot of open questions that haven't really been asked quite in the same way or with quite the same intensity a key piece of how retrieval needs to function for AI and anyone who's built one of these is aware of this is you need to be able to return all not just all relevant information but also no irrelevant information it's common knowledge by now and this is supported by Empirical research that distractors in the model context cause the performance of the entire AI based application to fall off a cliff if those distractors are present so what does it mean to actually retrieve relevant info and no irrelevant info you need to know which amending model you need to be using at all in the first place and we've all we've seen the claims from the different API and embedding model providers this one is best for code this one is best for English language this one is best for multilingual data sets but the reality is the only way to find out which is best for your data set is to have a a effective way to figure that out the next question of course is how do I chunk up the data chunking chunking determines what results are available to the model at all and it's obvious that um different types of chunking produce different relevancy in the return results and finally how do we even determine whether a given retrieved result is actually relevant to the task or to the user so let's dive into some of these in a little bit more depth so the bad news is again nobody really has the answers despite the fact that information retrieval is a long studied problem there isn't great solution to these problems today but the good news is that these are important problems and increasingly important problems and we see much more production data rather than sort of academic benchmarks um that we can work from to solve some of these for the first time so first the question of which embedding model should would be using of course there are existing academic benchmarks and for now these appear to be mostly uh saturated the reason for that is these are synthetic benchmarks designed specifically for the information retrieval problem and don't necessarily reflect how retrieval systems are used in AI use cases so what can you do about that you can take some of the open source tooling built to build these benchmarks in the first place and apply it to your data sets and your use cases um you can use human feedback on relevance by adding a simple relevance feedback endpoint and this is something that chromer is building to support in the very near future you can construct your own data sets because you're viewing your data in production you know what actually matters to you and then you need the effect you need a way to effectively evaluate um the performance of particular embedding models of course there are great evaluation tools coming onto the market now from several vendors um which of these is best we don't know but we intend to support all of these with chroma um one interesting part about embedding models and this is again this is a piece of this is something that's been well known in the research community for a while but has been empirically tested recently embedding models with the same training objective with roughly the same data tend to learn very similar representations up to an aine linear transform which suggests that it's possible to project one model's embedding space into another model's embedding Space by using a simple linear transform so this the choice of which embedding model you actually want to use might not end up being so important if you're actually able to um to sort of apply and figure out those transform from your own data set so the question is how to chunk um of course there's a few things to consider chunking in part exists because we have bounded context lengths for our llms uh so we want to make sure that the retrieved results can actually fit in that context we want to make sure that we retain the semantic content of uh of um of the data we're aiming to retrieve then we want to make sure that we retrieve that we retain the relevant semantic content of that data rather than um rather than just semantic content in general we also want to make sure that we're respecting the natural structure of the data because often especially textual data was generated for humans to read and understand in the first place so this inherent structure of that data provides cues about where the semantic boundaries might be of course there are tools for chunking there's nltk there's Lang chain uh llama index also supports many forms of chunking um but there are experimental ideas here which we're particularly interested in trying um one interesting thought that we've had and we're experimenting with lightweight open source language models to achieve these is using the model prediction perplexity for the next actual token in the in the document based on a sliding window of previous tokens um in other words you can see when the model mispredict or has a very low probability for the next actual piece of text as a determinator of where a semantic boundary in the text might be and that might be natural for chunking and what that also means is because you have a model actually predict predicting chunk boundaries you can then fine-tune that model to make sure the chunk boundaries are relevant to your application so this is something that we're actively exploring we can information hierarchies again tools like llama index support information hierarchies out of the box and multiple data sources and signals to ranking and we can also try to use embedding continuity this is something that we're experimenting with as well where essentially you take a sliding window uh across your documents uh embed that sliding window and look for discontinuities in the resulting time series so this is this is an important question and I'll give you a demonstration about why retrievable results being able to compute retrievable result relevance is actually very important in your application imagine in your application you've gone and you've embedded every English language Wikipedia page about birds and that's what's in your Corpus and in your traditional retrieval augmented generation system what you're doing for each query is just returning the five nearest neighbors and then stuffing them into the model's context window now one day a user's query comes along and that query is about fish and not Birds you're guaranteed to return some five nearest neighbors but you're also guaranteed to not have a single relevant result among them how can you as an application developer make that determination so there's a few possibilities here the first of course is um human feedback around relevancy signal the traditional approach in information retrieval is using an auxiliary reranking model in other words you take other signals um in sort of the query chain so what else was the user looking at at the time what things has the user uh found to be useful in the past and use those as additional signal around the uh around the relevancy and we can also of course do augmented retrieval which chroma does out of the box we have keyword-based search uh and we have metad databased filtering so you can scope the search uh if you have those additional signals beforehand now to me the most interesting approach here is actually an algorithmic one so what I mean by that is conditional on the data set that you have available and conditional on what we know about the task that the user is trying to perform it should be possible to generate a conditional relevancy signal per user per task per model and per instance of that task but this requ Ires a model which can understand the semantics of the query as well as the content of the data set very well this is something that we're experimenting with and this is another place where we think open-source lightweight language models have actually a lot to offer even at the data layer so to talk about a little bit about what we're building um this is the advertising portion of my talk in core engineering we're of course building a horizontally scalable cluster version single node chroma works great many of you have probably already tried it by now it's time to actually make it work across multiple nodes um by December we'll have our databases a service technical preview up and ready so you guys can try chroma cloud in January we'll have our hybrid deployments available if you want to run chroma in your Enterprise cluster and along the way we're building to support multimodal um data we know that um GPT Visions API is coming very soon probably at open ai's developer day um Gemini will also have image understanding and voice that means that you'll be able to use multimodal data in your retrieval applications for the first time so we're no longer just talking about text so these questions about relevancy and other types of data become even more important right because now you start having questions about relevancy aesthetic quality all of these other pieces um which you need to make these multimodal retrieval augmented systems work and finally we're working on model selection chroma basically chroma wants to do everything in the data layer for you so that just like a modern dbms just like you use postr in a web application everything in the data layer for as an application developer should just work your focus should be on the application logic and making your application actually run correctly and that's what chromer is building for in Ai and that's it thank you very [Applause] much

Original Description

In the last few months, we've seen an explosion of the use of retrieval in the context of AI. Document question answering, autonomous agents, and more use embeddings-based retrieval systems in a variety of ways. This talk will cover what we've learned building for these applications, the challenges developers face, and the future of retrieval in the context of AI. Recorded live in San Francisco at the AI Engineer Summit 2023. See the full schedule of talks at https://ai.engineer/summit/schedule & join us at the AI Engineer World's Fair in 2024! Get your tickets today at https://ai.engineer/worlds-fair About Anton Troynikov Anton is the co-founder of Chroma. He does not believe AI will kill us all. Chroma build an open-source embeddings store, specifically built for AI-native applications.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Engineer · AI Engineer · 25 of 60

1 AI Engineer Summit 2023 — DAY 1 Livestream
AI Engineer Summit 2023 — DAY 1 Livestream
AI Engineer
2 AI Engineer Summit 2023 — DAY 2 Livestream
AI Engineer Summit 2023 — DAY 2 Livestream
AI Engineer
3 Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)
Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)
AI Engineer
4 Announcing the AI Engineer Network: Benjamin Dunphy
Announcing the AI Engineer Network: Benjamin Dunphy
AI Engineer
5 The 1,000x AI Engineer: Swyx
The 1,000x AI Engineer: Swyx
AI Engineer
6 Building AI For All: Amjad Masad & Michele Catasta
Building AI For All: Amjad Masad & Michele Catasta
AI Engineer
7 The Age of the Agent: Flo Crivello
The Age of the Agent: Flo Crivello
AI Engineer
8 See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman
See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman
AI Engineer
9 Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase
Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase
AI Engineer
10 Pydantic is all you need: Jason Liu
Pydantic is all you need: Jason Liu
AI Engineer
11 Building Blocks for LLM Systems & Products: Eugene Yan
Building Blocks for LLM Systems & Products: Eugene Yan
AI Engineer
12 The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer
The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer
AI Engineer
13 Climbing the Ladder of Abstraction: Amelia Wattenberger
Climbing the Ladder of Abstraction: Amelia Wattenberger
AI Engineer
14 Supabase Vector: The Postgres Vector database: Paul Copplestone
Supabase Vector: The Postgres Vector database: Paul Copplestone
AI Engineer
15 [Workshop] AI Engineering 101
[Workshop] AI Engineering 101
AI Engineer
16 The Hidden Life of Embeddings: Linus Lee
The Hidden Life of Embeddings: Linus Lee
AI Engineer
17 [Workshop] AI Engineering 201: Inference
[Workshop] AI Engineering 201: Inference
AI Engineer
18 The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex
The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex
AI Engineer
19 The AI Evolution: Mario Rodriguez, GitHub
The AI Evolution: Mario Rodriguez, GitHub
AI Engineer
20 Move Fast Break Nothing: Dedy Kredo
Move Fast Break Nothing: Dedy Kredo
AI Engineer
21 AI Engineering 201: The Rest of the Owl
AI Engineering 201: The Rest of the Owl
AI Engineer
22 Building Reactive AI Apps: Matt Welsh
Building Reactive AI Apps: Matt Welsh
AI Engineer
23 Pragmatic AI with TypeChat: Daniel Rosenwasser
Pragmatic AI with TypeChat: Daniel Rosenwasser
AI Engineer
24 Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan
Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan
AI Engineer
Retrieval Augmented Generation in the Wild: Anton Troynikov
Retrieval Augmented Generation in the Wild: Anton Troynikov
AI Engineer
26 Building Production-Ready RAG Applications: Jerry Liu
Building Production-Ready RAG Applications: Jerry Liu
AI Engineer
27 120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson
120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson
AI Engineer
28 The Weekend AI Engineer: Hassan El Mghari
The Weekend AI Engineer: Hassan El Mghari
AI Engineer
29 Harnessing the Power of LLMs Locally: Mithun Hunsur
Harnessing the Power of LLMs Locally: Mithun Hunsur
AI Engineer
30 Trust, but Verify: Shreya Rajpal
Trust, but Verify: Shreya Rajpal
AI Engineer
31 Open Questions for AI Engineering: Simon Willison
Open Questions for AI Engineering: Simon Willison
AI Engineer
32 Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD
Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD
AI Engineer
33 GPT Web App Generator - 10,000 apps created in a month: Matija Sosic
GPT Web App Generator - 10,000 apps created in a month: Matija Sosic
AI Engineer
34 Using AI to Build an Infinite Game: Jeff Schomay
Using AI to Build an Infinite Game: Jeff Schomay
AI Engineer
35 How to Become an AI Engineer from a Fullstack Background - Reid Mayo
How to Become an AI Engineer from a Fullstack Background - Reid Mayo
AI Engineer
36 The Code AI Maturity Model and What It Means For You: Ado Kukic
The Code AI Maturity Model and What It Means For You: Ado Kukic
AI Engineer
37 AI Engineer World’s Fair 2024 - Keynotes & Multimodality track
AI Engineer World’s Fair 2024 - Keynotes & Multimodality track
AI Engineer
38 From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet
From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet
AI Engineer
39 The Making of Devin by Cognition AI: Scott Wu
The Making of Devin by Cognition AI: Scott Wu
AI Engineer
40 The Future of Knowledge Assistants: Jerry Liu
The Future of Knowledge Assistants: Jerry Liu
AI Engineer
41 Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney
Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney
AI Engineer
42 Open Challenges for AI Engineering: Simon Willison
Open Challenges for AI Engineering: Simon Willison
AI Engineer
43 Lessons From A Year Building With LLMs
Lessons From A Year Building With LLMs
AI Engineer
44 From Software Developer to AI Engineer: Antje Barth
From Software Developer to AI Engineer: Antje Barth
AI Engineer
45 Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner
Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner
AI Engineer
46 Copilots Everywhere: Thomas Dohmke and Eugene Yan
Copilots Everywhere: Thomas Dohmke and Eugene Yan
AI Engineer
47 Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han
Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han
AI Engineer
48 Low Level Technicals of LLMs: Daniel Han
Low Level Technicals of LLMs: Daniel Han
AI Engineer
49 Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta
Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta
AI Engineer
50 How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou
How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou
AI Engineer
51 What's new from Anthropic and what's next: Alex Albert
What's new from Anthropic and what's next: Alex Albert
AI Engineer
52 Using agents to build an agent company: Joao Moura
Using agents to build an agent company: Joao Moura
AI Engineer
53 Decoding the Decoder LLM without de code: Ishan Anand
Decoding the Decoder LLM without de code: Ishan Anand
AI Engineer
54 Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner
Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner
AI Engineer
55 Building with Anthropic Claude: Prompt Workshop with Zack Witten
Building with Anthropic Claude: Prompt Workshop with Zack Witten
AI Engineer
56 Building Reliable Agentic Systems: Eno Reyes
Building Reliable Agentic Systems: Eno Reyes
AI Engineer
57 10x Development: LLMs For the working Programmer - Manuel Odendahl
10x Development: LLMs For the working Programmer - Manuel Odendahl
AI Engineer
58 Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner
Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner
AI Engineer
59 Hypermode Launch: Kevin Van Gundy
Hypermode Launch: Kevin Van Gundy
AI Engineer
60 Git push get an AI API: Ryan Fox-Tyler
Git push get an AI API: Ryan Fox-Tyler
AI Engineer

This video teaches the fundamentals of Retrieval Augmented Generation (RAG) and its applications in AI, covering topics such as vector stores, human feedback, and self-updates, with a focus on practical implementation and evaluation. The speaker discusses various tools and techniques, including chroma, landyards, and Nvidia's Voyager paper. By watching this video, viewers can learn how to build and evaluate RAG systems, and implement advanced RAG techniques.

Key Takeaways
  1. Build a vector store using chroma
  2. Implement human feedback into the retrieval loop
  3. Use auxiliary reranking models for information retrieval
  4. Employ keyword-based search and metadata-based filtering
  5. Generate a conditional relevancy signal per user per task per model
  6. Build a horizontally scalable cluster version of chroma
  7. Support multimodal data with GPT Visions API and Gemini
💡 The speaker highlights the importance of considering chunking, semantic boundaries, and information hierarchies when implementing RAG systems, and discusses the challenges of determining which embedding model to use and how to chunk up the data.

Related Reads

📰
RAG Is Not a Feature. It's a System, and These Are the Parts Nobody Demos.
Learn how RAG is a system, not a feature, and understand its key components beyond demos
Dev.to · Marketing wizr
📰
What Is RAG? The AI Technology That Makes ChatGPT Smarter Without Retraining
Learn about RAG, the AI technology that enhances ChatGPT's capabilities without requiring retraining, and why it matters for advancing language models
Medium · RAG
📰
Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On
Learn the limitations of linear RAG pipelines and how agentic workflows are becoming a popular alternative for more efficient and effective AI workflows
Medium · AI
📰
Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On
Learn why linear RAG pipelines have limitations and how Agentic workflows are becoming a preferred alternative in the industry
Medium · Machine Learning
Up next
RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python
Professor Py: AI Engineering
Watch →