Multi-modal Retrieval Augmented Generation with LlamaIndex

LlamaIndex · Intermediate ·🔍 RAG & Vector Search ·2y ago

Key Takeaways

This video demonstrates how to build production-ready Retrieval Augmented Generation (RAG) applications using LlamaIndex's multi-modal capabilities, including text, images, and audio. It covers the basics of RAG, LlamaIndex, and its components, as well as how to perform basic image querying, multi-modal retrieval, and image-to-image retrieval.

Full Transcript

hi everybody I'm lri VP of developer relations at llama index and today I'm going to be talking about retrieval augmented generation or rag specifically multimodal rag that is rag that includes not just text but also uh images or audio I'll briefly reintroduce you to Lama index and some of our features that get your apps to production then dive into the details of how to make a multimodal app using llama index so let's get started let's start with a quick recap of the magic of rag the core of retrieval augmented generation is the fact that retrieval works in the first place when llms learn information what they're really doing is converting it into numbers specifically vectors we call the total set of available vectors the vector space so when you convert your data into vectors we say that you're embedding it into the vector space which is why we call these things embeddings for short an amazing property of vector embeddings is that if you take a question and you convert that into a vector as well it will end up nearby in Vector space to the data that contains the answer this isn't keyword matching it's encoding the meaning of the question and you can use quite simple math to find text with similar meanings which is a pretty magical feature so that's how rag works first you embed all of your data using an llm model then you embed your query using the same llm model and you perform some relatively simple math that tells you which chunks are closest in meaning to your question those are the most significant pieces of context so you send those and your question to an llm and most of the time that's enough to get a right answer multimodality has recently exploded into the llm world this gives llms the ability to understand images and audio not just the text on which they were originally based but amazingly rag can work exactly the same way just as you can embed text you can embed images and audio once your text and images are vectors the same math Works to retrieve them you can even embed a text query to get back images or embed an image to get similar images we're going to show you how to do both of those things today but first let's do a quick refresher on llama index itself a retrieval augmented gen generation application can be said to have six stages first you have to load your data from wherever it sits second you have to index it cut it into chunks and feed them into an embedding model third you have to store all of that stuff in a vector store then when you're ready to ask a question step four is you give the query to the vector store and it retrieves the most relevant context you feed that context and your query as a prompt to an llm which synthesizes your answer we sometimes call just call stages four through six just querying and there is a seventh stage called evaluation which we'll be covering in a follow-up video to this one a multimodal rag application is much the same but it starts with two parallel paths because while you can embed images you can't use the exact same embedding model so we load text and images in parallel and we embed them differently if we are embedding text we would use something like Ada O2 if we're embedding images we'd use something like clip we store them separately but often side by side in the same database management system at the retrieval stage we fetch results from both stores and provide them as context to the llm to respond to the query managing all of these stages and all and all of the storing and indexing is a lot of work which brings us to llama index llama index is a framework that takes care of all of those stages I just mentioned in the six lines of code that I'm showing here on line two it loads everything from a local directory on line three it indexes it and then stores it in memory on line four the query engine is instantiated and on line five it takes care of retrieval the prompting and the synthesis and line six is the result of course this is a toy example in production you'd want to store stuff in a vector store not in memory and you'd want to get the data from somewhere else that brings us to llam HUB our registry of hundreds of connectors that can get data from anywhere from a database to your Google drive or your slack or notion on llama Hub we also have llama packs which are pre-built code snipets that you can pull into your application and turn big chunks of code of fiddly code into one liners like a complex query strategy or even an entire application core like a chatbot we also have a one-step solution to get all of this stuff into production it's a command line tool called create llama everything we do is called llama something sorry create llama is based on the idea of create react app it puts together a full stack llama index application for you ready to host uh on a deployment Target like for sell or render so now let's build an app with multimodal retrieval you can follow along in the linked notebook I'm going to skip the parts of the notebook where we just install dependencies and set up our API key and fetch our test data let's get to the first important part where we load in some images the images we're giving it giving to it this time are this set of images and cars and a bunch of text about those cars it's a mixture of Toyotas Volkswagens and Teslas loading images in Lama index is exactly the same as loading text in fact it uses exactly the same loader simple directory reader for a production use case you could use a loader from llama Hub like our AWS S3 loader all we're doing here is passing the simple directory reader a list of exactly three images to load I happen to know that these images are one Tesla One Toyota and one Volkswagen but the llm doesn't know that so let's fire up the llm and ask it a question about these images so we instantiate our llm we're using open ai's GPT 4V but we support several others in Lama index including lava fuu mini GPT and Cog vlm all via replicate we now do the simplest thing possible we just ask GPT to describe what it's looking at which it does a bang up job at distinguishing between three cars that definitely could not tell apart in real life now let's do something slightly less simple and index text and images and then retrieve them as I mentioned earlier we need both a text store and an image store to make a multimodal index so let's make both of those here and then pass them to the storage context so that we can use them later now once again we load all of the data from local disk this works whether it's text or images or both uh and we instantiate our multimodal index which requires the storage context that we created in the last step now we instantiate a retriever we give it separate parameters for how many things to retrieve from the text and image indexes so it finds three text nodes about Toyotas and three images of Toyotas which is exactly what we asked it to do in the notebook I then ask it for Tesla's instead to prove that it's not just fetching Toyotas by default and it correctly does that as well finally let's query this multimodal index we've just created we instantiate a query engine and we pass it the same parameters that we would have passed the retriever about how many things to fetch and we ask it to compare the Toyotas so first the retriever finds the Toyotas and then the llm gets the pictures of Toyotas as context and answers the question about them this is cool we've done all six stages of a r rag application here we loaded we indexed we stored we retrieved we prompted and we synthesized now let's go one step further and instead of finding images with text let's find an image with another image this one's a new notebook that you can also follow along in as usual we'll skip the basic setup steps the first complicated thing we do is fire up a Wikipedia client which downloads a bunch of images and text from several unrelated Wikipedia Pages such as Vincent Van Go San Francisco and Batman we get a nice diverse set of images and text we create a multimodal index exactly the same way we did in the last notebook an image store a text store load all the documents into a multimodal index give it to the storage context uh and now we instantiate the retriever just the same way we did before but now we call a new method that we haven't seen before imageo image retrieve as input we'll pass it a single image the image we're using to search is a picture of star KN by Van go that we've downloaded separately and isn't in the set that we're searching what we get back is four van go paintings the embeddings of all the van go paintings are sufficiently semantically similar that the retriever finds them next to each other in Vector space which I think is really NE so that's imageo image retrieval now let's do querying using an image's input this is a little bit more advanced than basic text quaring we have to create a custom prompt and then intend a query Engine with that prompt as well as our usual things like a multimodal llm to use and parameters about how many text chunks and images to fetch now we call another new method image query which takes our query and an image file as input again the input image we're using is star night and so the results are an analysis of Van go paintings the retrieval step has found painting similar to Star night just like it did when we were doing retrieval and it has passed the resulting pictures as context to the llm which is synthesized a response about post impressionism again this is all six stages of a rag application we loaded a set of random images and text from Wikipedia we indexed them against embedding models we stored the results in text and image Vector stores we retrieved images matching our queries we prompted the llm what to do and with the images and our query uh and it synthesized a response so today we've covered why the retrieval in retrieval augmentation Works how it works in a multimodal case how llama index helps you build multimodal applications and several examples of multimodal use cases using llama index to get them done I hope this helped you out on your Learning Journey in AI if you want to dive deeper docs. Lama index. is the place to go and I hope to see you again soon thanks for your time

Original Description

In this deep dive we'll show you how to build production RAG applications using LlamaIndex's multi-modal capabilities, including * How RAG works * What LlamaIndex, LlamaHub and create-llama are * How to do basic image querying, multi-modal retrieval, multi-modal querying, image-to-image retrieval and image-to-text querying Linked notebooks: Multi-modal retrieval and querying: https://bit.ly/multi-modal-1 Image-to-image retrieval and querying: https://bit.ly/multi-modal-2 Learn more at https://docs.llamaindex.ai/
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from LlamaIndex · LlamaIndex · 42 of 60

1 LlamaIndex Virtual Meetup (May 4th, 2023)
LlamaIndex Virtual Meetup (May 4th, 2023)
LlamaIndex
2 LlamaIndex + MongoDB Workshop/Fireside Chat
LlamaIndex + MongoDB Workshop/Fireside Chat
LlamaIndex
3 Discover LlamaIndex: Ask Complex Queries over Multiple Documents
Discover LlamaIndex: Ask Complex Queries over Multiple Documents
LlamaIndex
4 Discover LlamaIndex: Document Management
Discover LlamaIndex: Document Management
LlamaIndex
5 Discover LlamaIndex: Joint Text to SQL and Semantic Search
Discover LlamaIndex: Joint Text to SQL and Semantic Search
LlamaIndex
6 Discover LlamaIndex: JSON Query Engine
Discover LlamaIndex: JSON Query Engine
LlamaIndex
7 LlamaIndex Webinar: Active Retrieval Augmented Generation
LlamaIndex Webinar: Active Retrieval Augmented Generation
LlamaIndex
8 LlamaIndex Webinar: Demonstrate-Search-Predict (DSP) with Omar Khattab
LlamaIndex Webinar: Demonstrate-Search-Predict (DSP) with Omar Khattab
LlamaIndex
9 LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs
LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs
LlamaIndex
10 LlamaIndex Webinar: Graph Databases, Knowledge Graphs, and RAG with Wey (NebulaGraph)
LlamaIndex Webinar: Graph Databases, Knowledge Graphs, and RAG with Wey (NebulaGraph)
LlamaIndex
11 LlamaIndex Webinar: Community Project Showcase (07/07/2023)
LlamaIndex Webinar: Community Project Showcase (07/07/2023)
LlamaIndex
12 LlamaIndex Webinar: LLMs for Investment Research (with Didier Lopes, co-founder/CEO at OpenBB)
LlamaIndex Webinar: LLMs for Investment Research (with Didier Lopes, co-founder/CEO at OpenBB)
LlamaIndex
13 Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 1, LLMs and Prompts)
Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 1, LLMs and Prompts)
LlamaIndex
14 Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)
Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)
LlamaIndex
15 Discover LlamaIndex: Key Components to build QA Systems
Discover LlamaIndex: Key Components to build QA Systems
LlamaIndex
16 Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 3, Evaluation)
Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 3, Evaluation)
LlamaIndex
17 LlamaIndex Webinar: From Prompt to Schema Engineering with Pydantic  (with @jxnlco)
LlamaIndex Webinar: From Prompt to Schema Engineering with Pydantic (with @jxnlco)
LlamaIndex
18 Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 4, Embeddings)
Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 4, Embeddings)
LlamaIndex
19 Discover LlamaIndex: Custom Retrievers + Hybrid Search
Discover LlamaIndex: Custom Retrievers + Hybrid Search
LlamaIndex
20 LlamaIndex Webinar: Document Metadata and Local Models for Better, Faster Retrieval
LlamaIndex Webinar: Document Metadata and Local Models for Better, Faster Retrieval
LlamaIndex
21 LlamaIndex Webinar: Build Personalized AI Characters with RealChar
LlamaIndex Webinar: Build Personalized AI Characters with RealChar
LlamaIndex
22 LlamaIndex Webinar: Make RAG Production-Ready
LlamaIndex Webinar: Make RAG Production-Ready
LlamaIndex
23 LlamaIndex Workshop: Building RAG with Knowledge Graphs
LlamaIndex Workshop: Building RAG with Knowledge Graphs
LlamaIndex
24 Discover LlamaIndex: Introduction to Data Agents for Developers
Discover LlamaIndex: Introduction to Data Agents for Developers
LlamaIndex
25 LlamaIndex Webinar: Finetuning + RAG
LlamaIndex Webinar: Finetuning + RAG
LlamaIndex
26 Discover LlamaIndex: SEC Insights, End-to-End Guide
Discover LlamaIndex: SEC Insights, End-to-End Guide
LlamaIndex
27 Discover LlamaIndex: Custom Tools for Data Agents
Discover LlamaIndex: Custom Tools for Data Agents
LlamaIndex
28 LlamaIndex Sessions: Building a Lending Criteria Chatbot in Production
LlamaIndex Sessions: Building a Lending Criteria Chatbot in Production
LlamaIndex
29 Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 5, Retrievers + Node Postprocessors)
Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 5, Retrievers + Node Postprocessors)
LlamaIndex
30 LlamaIndex Webinar: How to Win a LLM Hackathon
LlamaIndex Webinar: How to Win a LLM Hackathon
LlamaIndex
31 LlamaIndex Webinar: LLM Challenges in Production (w/ Mayo Oshin, AI Jason, Dylan from Starmorph)
LlamaIndex Webinar: LLM Challenges in Production (w/ Mayo Oshin, AI Jason, Dylan from Starmorph)
LlamaIndex
32 LlamaIndex Webinar: Agents Showcase!
LlamaIndex Webinar: Agents Showcase!
LlamaIndex
33 LlamaIndex Webinar: Learn about DSPy
LlamaIndex Webinar: Learn about DSPy
LlamaIndex
34 LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)
LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)
LlamaIndex
35 LlamaIndex Webinar: Build/Break/Test LLM Apps Showcase (co-hosted with TrueEra, Pinecone)
LlamaIndex Webinar: Build/Break/Test LLM Apps Showcase (co-hosted with TrueEra, Pinecone)
LlamaIndex
36 LlamaIndex Workshop: Evaluation-Driven Development (EDD)
LlamaIndex Workshop: Evaluation-Driven Development (EDD)
LlamaIndex
37 LlamaIndex Webinar: Building LLM Apps for Production, Part 1 (co-hosted with Anyscale)
LlamaIndex Webinar: Building LLM Apps for Production, Part 1 (co-hosted with Anyscale)
LlamaIndex
38 LlamaIndex Webinar: Learn about Fine-tuning + RAG (w/ Victoria Lin, author of RA-DIT)
LlamaIndex Webinar: Learn about Fine-tuning + RAG (w/ Victoria Lin, author of RA-DIT)
LlamaIndex
39 LlamaIndex Webinar: What's next for AI after OpenAI Dev Day?
LlamaIndex Webinar: What's next for AI after OpenAI Dev Day?
LlamaIndex
40 Introducing create-llama
Introducing create-llama
LlamaIndex
41 LlamaIndex Webinar: PrivateGPT - Production RAG with Local Models
LlamaIndex Webinar: PrivateGPT - Production RAG with Local Models
LlamaIndex
Multi-modal Retrieval Augmented Generation with LlamaIndex
Multi-modal Retrieval Augmented Generation with LlamaIndex
LlamaIndex
43 LlamaIndex Webinar: LLaVa Deep Dive
LlamaIndex Webinar: LLaVa Deep Dive
LlamaIndex
44 A deep dive into Retrieval-Augmented Generation with Llamaindex
A deep dive into Retrieval-Augmented Generation with Llamaindex
LlamaIndex
45 LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini
LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini
LlamaIndex
46 LlamaIndex Webinar: Efficient Parallel Function Calling Agents with LLMCompiler
LlamaIndex Webinar: Efficient Parallel Function Calling Agents with LLMCompiler
LlamaIndex
47 Introduction to Query Pipelines (Building Advanced RAG, Part 1)
Introduction to Query Pipelines (Building Advanced RAG, Part 1)
LlamaIndex
48 LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)
LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)
LlamaIndex
49 LlamaIndex Webinar: Advanced Tabular Data Understanding with LLMs
LlamaIndex Webinar: Advanced Tabular Data Understanding with LLMs
LlamaIndex
50 Ollama X LlamaIndex Multi-Modal
Ollama X LlamaIndex Multi-Modal
LlamaIndex
51 Build Agents from Scratch (Building Advanced RAG, Part 3)
Build Agents from Scratch (Building Advanced RAG, Part 3)
LlamaIndex
52 LlamaIndex Webinar: Build No-Code RAG with Flowise
LlamaIndex Webinar: Build No-Code RAG with Flowise
LlamaIndex
53 LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)
LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)
LlamaIndex
54 Introduction to LlamaIndex v0.10
Introduction to LlamaIndex v0.10
LlamaIndex
55 Build SELF-DISCOVER from Scratch with LlamaIndex
Build SELF-DISCOVER from Scratch with LlamaIndex
LlamaIndex
56 Introducing LlamaCloud (and LlamaParse)
Introducing LlamaCloud (and LlamaParse)
LlamaIndex
57 LlamaIndex Sessions: 12 RAG Pain Points and Solutions
LlamaIndex Sessions: 12 RAG Pain Points and Solutions
LlamaIndex
58 LlamaIndex Webinar: RAG Beyond Basic Chatbots
LlamaIndex Webinar: RAG Beyond Basic Chatbots
LlamaIndex
59 A Comprehensive Cookbook for Claude 3
A Comprehensive Cookbook for Claude 3
LlamaIndex
60 LlamaIndex Webinar: RAPTOR - Tree-Structured Indexing and Retrieval
LlamaIndex Webinar: RAPTOR - Tree-Structured Indexing and Retrieval
LlamaIndex

This video teaches how to build production-ready RAG applications using LlamaIndex's multi-modal capabilities. It covers the basics of RAG, LlamaIndex, and its components, as well as how to perform basic image querying, multi-modal retrieval, and image-to-image retrieval. By the end of this video, viewers will be able to build and deploy their own RAG applications using LlamaIndex.

Key Takeaways
  1. Load images and text from local disk
  2. Index text and images in separate stores
  3. Instantiate a retriever with parameters for text and image indexes
  4. Ask the retriever to find text and images
  5. Query the multimodal index using a query engine
  6. Fire up a Wikipedia client to download images and text
  7. Create a multimodal index with an image store and text store
  8. Instantiate the retriever with the imageo image retrieve method
  9. Create a custom prompt for querying using an image's input
  10. Call the image query method with the query and an image file as input
💡 LlamaIndex provides a framework for building production-ready RAG applications with multi-modal capabilities, allowing developers to easily integrate text, images, and audio into their applications.

Related AI Lessons

Your AI Keeps Making Things Up. RAG Is How You Make It Use Real Facts Instead.
Learn how to use RAG to make your AI provide accurate answers based on real facts instead of making things up
Medium · RAG
Evaluation Metrics for RAG: Measure Retrieval, Generation, and End-to-End Quality With Numbers That…
Learn to evaluate RAG models using metrics that measure retrieval, generation, and end-to-end quality
Medium · AI
Evaluation Metrics for RAG: Measure Retrieval, Generation, and End-to-End Quality With Numbers That…
Learn to evaluate RAG models using metrics that measure retrieval, generation, and end-to-end quality
Medium · Data Science
When Does HyDE Help RAG? I Tested 3 Query Types and It Failed on Two
Learn when HyDE retrieval helps or hinders RAG performance across different query types, and why it matters for improving search accuracy
Medium · AI
Up next
RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python
Professor Py: AI Engineering
Watch →