Multimodal RAG: Chat with PDFs (Images & Tables) [2025]

Alejandro AO · Beginner ·🧠 Large Language Models ·1y ago

Skills: Multimodal LLMs90%RAG Basics80%Prompt Craft70%Vector Stores70%Fine-tuning LLMs60%

Key Takeaways

This video tutorial demonstrates how to build a multimodal Retrieval-Augmented Generation (RAG) pipeline using LangChain and the Unstructured library, enabling AI-powered systems to query complex documents such as PDFs containing text, images, tables, and plots. The tutorial covers the use of various tools and techniques, including GPT-40 mini, ChromaDB, and Gro, to create a sophisticated multi-vector store and extract structured data from unstructured documents.

Full Transcript

good morning everyone how's it going today welcome back to the channel and welcome to this new video in today's video I want to show you how to chat with a PDF and take into account the images the tables the plots and everything else that can be in your in your document for the generation of the response okay and we're going to be doing that and that's going to be looking something like this so in essence you're going to be querying your your pipeline so here I have an sample with the attention is all un need paper from Google and I qued what do the authors mean by attention Okay and as you can see that the retrieved part of the document was this part right here attention and you can see that we also have the images that will retrieved and this is everything that is going to be sent to the language model including the images so that the language model can give us a response uh based on the this okay um so yeah everything everything in here is going to go into the language model as a context and it's going to give us an answer and in order to do this we're going to be explaining the whole process with this very nice diagram that you see right here we're going to be using unstructured or parsing our document into images tables and text and we're of course going to be using a language model that has multimodal input okay uh that is to say a language model with vision uh in this case we're going to be using GPT 40 mini uh in order to interpret the images the tables and the text and as you can see we're going to be uh creating this very sophisticated multi Vector store uh using uh L chain very cool very convenient and it's actually easier than you may think um so we're going to be going through the entire process of this notebook and uh the notebook of course is available in the description and uh if you have any questions don't hesitate to ask me and if you don't understand um what rack is because you probably want to know how to do rag or retrieval augmented generation without the images only with text you should probably watch that video first and that is also in the description of this video um and then take a look at multimodel Rack okay so there you go without any further Ado let's get right into the video [Music] all right so let's uh explain a little bit about what we're going to be uh what method we're going to be introducing right here as I mentioned before this is not the only method and I am only going to be explaining to you in this uh particular video the process we're going to be covering the code for how to do this in the next lesson for now let's just cover the process that's going that's uh the one that uh that's going to be going on in this spip plane um and the process is actually um very well represented in this diagram right here um we're going to be using this library right here that you can see right here it is unstructured unstructured is an open source library that allows you to extract uh structured data from your unstructured documents in other words it allows you to take your unstructured and semi-structured data that can be coming from PDF files HTML files like your websites uh it can come from a CSV file for example from an Excel file um uh basically pretty much any uh kind of any format of a file that you that you can think of um as long as it is uh unstructured or semi-structured and it will split it into different components so you you will have your PDF document you will pass through the unstructured um the unstructured uh library that we're going to be using and we're going to get uh one array with all of the images from from your entire document another array with all of your tables from your entire document and another array with all of the text of your entire document and this is going to be very convenient because this will allow us to treat those types of um elements differently and to embed them them and to load them to our database differently depending on what they actually are and what they need uh to be loaded and what transformation we need to do in order to use them in our rag pipeline okay um so that's the first step the extraction uh once you have extracted this that is probably the most uh difficult thing to do and the most magical one because I'm structured is just so amazing for this uh we're going to be uh we're going to be loading this to we're going to be using a language model in order to summarize them okay so we're going to attach a summary to every single element that we have right here uh what I did in this particular example is I used a regular uh language model for the tables and the text so I converted I mean of course the text and the tables are regular text representation actually unstructured allows you to extract the HTML representation of a table that is within your PDF so um you will send what what we're going to be doing is we're going to be sending the HTML representation of all of our tables to a very quick language model in this case I'm going to be using language models from Gro I think I was using liap uh 3.1 and create a summary for the table and then we're going to do the same thing for every uh long piece of text we're going to be creating a summary for that piece of text that is going to allow us to embed the summary instead of embedding the entire text this is going to help us with retrieval usually that's a good technique uh because it allows you to to focus on the keywords that are actually relevant to the text that you're going to be embedding and for the image we're going to do the same thing we're going to also summarize or more than summarize describe the image that we're going to be covering and in order to do that we're going to use a language model that has multimodel capabilities in the case of this example I was using GPT 40 mini okay but feel free to use any language model that has uh multimodal input you can use for example Gemini 3 uh sorry Gemini 1.5 from Google you can use um Lava for example if you want an open source model um but yeah just to be clear up until this part right here we have already extracted all of the images all of the T all of the texts and we have also tagged them with a summary and it is the summary the one that we're going to embed using the embeddings model okay so that is the next step once you have extracted every single thing we're going to uh tie them together the summary and the original element using a doc ID okay this is going to be just a string with a with a Unicode uh very not a unic code but a a very specific ID that is going to link the original document to its summary okay so we're going to have all of our summaries linked um we're going to create documents from the summaries and in the metadata of each one of our documents we're going to put the ID and the same thing we're going to create an element right here and we're going to tag them uh the metadata of each one of these elements is going to have the dock the same dock ID pass their summary and then these summaries are the ones that are going to go into our Vector store and uh this right here is pretty much the same thing that we have been doing so far we're going to be vectorizing them using a text embeddings model then we're going to load it into a vector database in this case if I remember correctly I was using chrom ADB and uh the documents actually they're not going to go into the vector store because we are not going to be embedding them as I told you before that could be a possibility and that is actually one way of doing this if you want you could embed the whole thing using a multimodel embeddings model but in our case we're not using a multimodel embeddings model we're only using a text embeddings model and we're only going to uh vectorize the summaries okay so we're going to put the summaries into a vector database and the documents we're going to load them to a different database we're going to call it our documents tour okay but remember that even if they are in two different databases they are still linked by this doc ID metadata that we have assigned to them okay and this is very relevant because now retrieval becomes much more um becomes possible because now you can query your vector database like you would query a regular rack Pipeline and then the vector database will return to you the most relevant documents for your query so for example let's say that you're embedding uh let's say that you're loading a document uh for example you're loading the the document um the research paper attention is whole unit from um from Google right and um you're querying something like what is multi-ad attention so you're going to get the you're that is going to retrieve the summaries of the documents that talk about multi-ad attention okay so that's the the retrieve documents but in this case remember that our Vector database only contains the summaries not the documents themselves that is very important because we have the doc ID metadata assigned to them and this doc ID is the one that we're going to use to fetch from the vector store the documents that are actually um that we're actually looking for so then we're going to fetch those documents and since those documents uh remember that could be images tables and text we could actually uh get from the retrieve documents not only text but also images tables and texts so that is essentially uh what we're going to be doing and how multimodel retrieval Works in this particular uh way of doing this so we have our uh yeah I mean so just by the end as you can see we have a very simple retrieval pipeline in which we just ask a question send a query in text and as a return we get documents that can be images tables uh whatever we want whatever we loaded to our Vector store and whatever we were able to tag with a summary okay um and then we use that to generate an answer if we get images as context then we're probably going to have to use a language model that has multimodel input uh capabilities if you want to use the images as context which I suppose that that is what we want to do so yeah that is the whole process that we're going to be building in the next lesson we're going to be showing the code about how all of this works all right so let's actually start with the code right now and uh the first thing that I want you to do is to install the packages that we're going to need um in order to run the dependencies Okay so so it depends on what uh OS you're running uh but you're going to need popler TCT and lip magic here are just the quick instructions to do this to install them for Mac or Linux if you were using Windows there's also like U the instructions to to build this I'm probably going to to post them um under this video but uh yeah just get that installed I already have it installed but uh you can do that the next thing to do is to install the dependencies that we're going to need and and the ones that we're going to need in this case is unstructured uh oh by the way this notebook is of course available uh like in this lesson you can go uh right under this video there is the link to open it and um and uh yeah you will have access to the entire thing that the entire code that is right here so that you can run it on your on your end and implement it to your own pipeline um so as as I was telling you we are going to be installing onru shirt while we're going to be installing pillow lxml lxml um I installed pillow twice because why not uh well in my case I'm going to be using Chrome ADB uh we're going to need to have t token installed for the tokening tokenization uh we're going to be using Lang chain we're going to be using Lang chain Community Lang chain open AI because I'm going to be using the as my vision model where I'm going to be using uh GPT 40 uh mini fire remember correctly and for the just text based language models I'm going to be using uh L 3 if I'm not mistaken and uh just python. tnv for my um for my uh library for my environment variables and um if you're if you're um if you have any questions about it so I'm going to be of course initializing a Gro API key to use my uh open source llm models from The Croc API the open AI API key of course which you can get from the platform uh dashboard and then I also initialized a lang chain API tracing and the Lang chain API and Lang chain API key which are the API keys that we're going to use for lsmith now because I want to be able to trace what's going on behind the scenes in this Pipeline and uh once that is run I'm actually going to run it like this um you know I'm actually going to restart the kernel because I realized that uh the whole thing I had already run it so I'm on uh run 101 right now so just to make sure that we're all working with a similar environment now I'm going to rerun the whole thing uh so once you have installed everything the first thing that we're going to do is we're going to partition our PDF okay now in our case uh we're going to be dealing with with uh the attention is all un need PDF which we have right here and uh this is just an example of course um feel free to use any PDF you want but um I figured that this was a good example because even if it's a bit um a bit short it's just 15 pages it has some images it has a lot of text it has some equations as well if I remember correctly and uh yeah I mean it's just a multimodel uh document that is actually like more similar to the PDFs that you would see in real life here we have a table as well we're going to see how uh unstructured and how our pipeline extracts it but um this is the file that we're going to be loading uh so the first thing that we're going to do is we're going to partition it and in order to partition it we're going to be using unstructured Okay so let's get into how we're going to be using unstructured and what each um parameter right here does and before I forget to mention this uh this video is actually a lesson from the AI engineering cohort that I host H where I teach you how to go from beginner to this level of creating multimodel rack applications and from this level also all the way to creating uh multi-agent systems okay and this is a program that not only includes uh pre-recorded material with all of the content but also includes my personal help if you if you get stuck or something I can be there to solve your questions there are live sessions it is a cohort course which means that you will be uh interacting with me live so be sure to join that if you're interested in that and joining the community it is pretty fun and I can't wait to meet you there if you're interested just let me know if you have any questions about that too okay so on with the video okay so what do I have right here I have this um this function right here I mean I'm this call I'm calling this method from unstructured uh remember that we have already installed unstructured and I actually installed support for all kinds of documents uh in this case I'm only going to be using PDF so I could have technically just R just install the PDF um uh packages but I I mean if you you're going to want to install all dogs if you're going to be parsing a lot of different formats and in this case I'm going to be using the Partition p F uh method from unstructured now something that I uh might um it might be important to mention is that right here I am not using the loader from Lang chain okay Lang chain actually has a loader for unstructured and it works pretty well uh you can use it both locally and also with the serverless API but I wanted to show you a little bit more of the flexibility that you have with unstructured uh so that we can can see actually how all of these parameters work with the unstructured API uh so that you can actually feel and see what's actually going on under the hood when you're using unstructured so the partition PDF uh method and pretty much any partitioning method um from unstructured basically takes the file path to the file that you want to partition and it Returns the partitioned elements okay that's the only mandatory uh parameter that you're going to pass but here are other arguments that are available and let's take a look at them okay um so first of all we have the strategy which can be highr or it can be normal and uh in this case we're going to be choosing highr because we're setting this um this um this parameter to True infer table structure and this parameter essentially just means if you want to extract tables from your document or not and if you want to to extract tables from your document you're going to have to set it to true and if you want to extract tables then the high resolution strategy is mandatory you're going to have to choose this one if you want to extract tables okay so that's the first thing and in our case we are going to want to extract table so we're going to choose this settings right here now uh something else is that you're going to want to choose the the kind of images that you want to extract okay so in this case I have actually yeah I don't have it here so that's good uh I have this right here and um in this particular case uh what I have seen in previous examples of this tutorial because I am basing this on a cookbook from langing is they were using this parameter right here which is extract images in PDF and I said it to true however that parameter is actually deprecated or in uh on the way of being deprecated so you don't need to add it anymore I just added it it here for context because you're probably going to find it uh in the wild or in other tutorials if you don't uh mean if you're um so you don't get like um uh confused about it um so the new way of doing this I mean the updated way of doing this is with this parameter right here which is extract image block type and right here you're going to set it to image if you want uh to extract the images from your PDF if you want to extract the tables to for example you're going to tap table two like this now in our case we only want to extract the images this is not necessarily not going this is not necessarily not going to extract the tables it's just not going to extract the tables as images okay it's still going to extract the tables as I'm going to show you you in a minute and uh if you want to extract the images to an actual folder to an actual directory in your computer you can enable this and you can pass in an output path to get the to save the images to okay that's also a possibility in my case I'm not going to enable it just going to set it like this because I don't want to have the images uh from my PDF downloaded to my computer just want to have them in the partitioned uh element but uh feel free to enable this and play around with it and you will see that this will create a new folder right here with all your images okay um next thing that we have right here is extract image block uh to payload and this essentially means that we are going to be extracting the images and the image is going to be uh is going to have a metadata uh element that is going to contain the base 64 object of the image okay so if you set this to false you're not going to be extracting the Bas 64 um representation of your image which would be terrible because you're going to need the B 64 representation of your images if you want to send them to your language model because if you want to send an image to your language model you're going to have to send it uh to the API using a base 64 representation so this is the way to do it then this part right here is actually very interesting and um actually I'm going to show you how this works without it first um because it's it's super powerful and super cool but let me just show you how this works without it so without it I'm going to run this and um I mean it's going to start running it's probably going to take a a few seconds but um what is going to happen right now is that it is going to extract every single element from my PDF so it's going to return the table it's going to return this paragraph this other paragraph This title right here this other table and it's going to return everything just at once all the elements of my of my entire document are going to be returned into a single array and that's okay that's that could be what you want to do but um the unstructured uh service actually allows you to do something super cool which is to chunk it to to similar um um yeah to chunk it by at by by a strategy that you can choose okay you have by title or basic and this essentially means chunking usually you're you might be used to thinking of it but that it means making things smaller but actually in this case in their case when they when you implement a chunking strategy in unstructured it means you're putting elements together so right now without the chunkin strategy we're going to see that we if we see what kind of documents were returned to us you can see that we have title documents we have narrative documents we have footer documents image documents Etc so all of this are actually um the ones that are are available to us okay um yeah sorry this one's right here and um that's great that's that's great we have all the documents in a single in a single in a single array but you probably don't want that actually let me just show you uh what this looks like so if I do length of chunks you're going to see that we have 218 uh different elements inside the document so it split the entire PDF document into 218 uh different uh elements so we have this one might be one this one might be another one this title might be another one the table might be another one so you don't really want that what you want is to have them together uh to have the elements together that are um related to each other and that's where the chunking strategy comes into play I will let you play around with this to see what it actually does but uh if we enable chunking going to rerun this but with chunking this time I'm going to set the chunking strategy to by title a maximum um size of the chunk to 10,000 characters uh we're going to combine text character we're going to combine different elements um when they are under 2,000 characters and we're going to start a new uh part of the of the document after 6,000 characters if you can take a look at the documentation if you want to um delve into this a little bit more deeply but what this is essentially doing is that it is taking the elements from the 218 elements that we have in the document it's putting together those that are related inside the document and if you choose by title then it's probably going to go right here to our document and it's going to be like okay so this is one title so all of the documents inside this title are going to be assigned to a single chunk that's how they call it and then all of the elements assigned to this other title are going to be under a sing a single chunk as well and this is actually super useful for rag because if you're dealing with a like this one you're going to have um a single I mean a title A titled chapter talk about one single topic one single uh like it's going to have uh cohesive uh meaning and that's going to be super useful for rag So you you're going to be able to embed an entire chunk that is related um it's basically extracting chapter by chapter of the document so that's pretty useful uh it actually finished uh exporting it and you will see that in this case we don't have 218 uh documents anymore we only have 17 and uh the same way we don't have all of the different types we only have this two types which is composite element and table okay and composite element actually let me just show you what it looks like um going to go right here um going to go right here and um let's say so from chunks I'm going to see the first one let's see so the first one is a composite element I'm going to say to dictionary and you you can see right here the the elements of this uh composite element you can see that it's type of composite it has an ID it has some text it has some metadata it says which page number it is it comes from Etc which is very cool but uh interestingly it has this um this property inside of its metadata and that property right there let me show you um actually yeah metadata and I'm going to use this property right here which is original documents and that one right there a b show you not like that that that one right there actually contains all of the documents that are related or Associated to this particular chunk okay so remember that we had 218 documents um or elements inside the entire uh PDF so for 15 pages unstructured extracted 215 or 17 uh elements and then it Associated them together using a by tile chunking strategy and the 17 chunks that it returned to us are actually sets of these components right here and since we used the by tile um technique these are supposed to be one after another under the same section of the PDF I'm going to show you how this looks like in the actual PDF in just a moment but just I mean start figuring start visualizing it um actually just going to show you right now like how it how it looks um where do we have this I think it is here uh was just running some text tests but here as you can see here is one chunk um I'm going to be displaying one chunk and as you can see this chunk right here has uh it starts at this title called attention and it goes all the way up to here so as you can see it is kind of um uh a chapter in the in the document and it's it contains one two 3 4 5 6 7 8 9 10 11 12 13 elements inside of this chunk and they are all Associated to the same title that is because we used the chunking strategy okay and that is also the reason why we only have composite element and tables uh that were extracted okay because the composite elements are the ones that contain all of the other elements inside of their metadata and they under the key of original documents okay so so far so good we have successfully extracted our documents I'm going to erase this one because I had already shown you shown it to you right here but as you can see inside of the metadata um inside of the metadata uh original elements key we have a title narrative text a footer some a couple of images for this one in particular we have images uh the title Etc okay and uh let me show you how the images look like in inside of an unstructured uh document so in this in this um in this cell right here what I'm doing is essentially just extracting only the chunks I mean only the elements that have and that are of Type image okay so in here I basically listed all of the elements in chunk three um here we have a title a narrative Etc so what I'm doing right here is I'm just extracting those elements that have images uh so I'm taking all of the chunk all of the elements from from the chunk three and then I am extracting the images from that chunk um and I'm just selecting the first one and converting it to a dictionary and here you can see the representation of an image that was extracted by unstructured so you can see that it has a type of image it has some text because it is able to extract the text inside within the image it has the coordinates within the document itself this is going to be very useful um afterwards if you want to highlight where in the document this particular element is located and then very importantly right here we have the image base 64 representation and as you can see it's super long um but that's exactly what we want and we are only getting this because remember we set this parameter to True extract image block to payload to true and that is the only reason why we are getting this um this key right here okay and it is of course very important because this is the one that we're going to be sending to our multimodel language model okay so so far so good we have successfully extracted the elements and we can actually now split them uh so by the end of this splitting technique we're going to essentially have three different arrays one of tables one of texts and one of images just like we had in the diagram that we showed before so for Chunk in chunks like so for the 17 chunks that we extracted we're going to append the thing the element into um into table if it's a table and we're going to append it to texts if it's part of the composite element now this is technically a shortcut because remember that inside composite element there is also images but we're going to treat the images differently I mean of course you can improve on this spy plan if you want and actually pars the images within the composite element as well you feel free to do that but in this case I'm just going to be extracting the images and the composite elements and treating them um as these two different elements okay and then third I'm going to extract the images and in order to extract the images I'm going to extract them from the composite element right here going to tap into every single composite element if within the composite element I have an image element I'm going to add it to my images array so that way by the end I have these three different arrays one for tables one for texts and one for images and we have successfully completed the partitioning part or the extraction part now what we're going to do is we're going to have to is we're going to have to uh transform it okay and here I just have a very quick uh function that displays any image in base uh 64 so here just very quick function made with Chad GPT but essentially just takes the base 64 code of an image and it displays it um and here you can see that one from the array that we created the first element of the array I'm going to show it and as you can see this is the first image that was extracted from the document uh seems to be working pretty well so now we have this three arrays now it is time to actually go to the next part of this um this exercise which is summarizing the data that we just extracted so we're going to be creating a summary for each image for each table and for each piece of text all right so now it is time to start summarizing the data okay and that's what I'm going to do right here in order to summarize it I'm going to be using uh first of all a model from Gro I'm going to be using L uh 3.1 if I remember correctly and um in order to do that I'm going to be be installing of course Gro I'm going to be using chat Gro importing it from here I import chat PR template and my regular string output procer to create a chain okay now uh the chain to that is going to generate the summaries for my um for my text elements is going to be uh this one right here here an assistant tasked with summarizing the tables Etc just a very simple chain it's going to pass into prompt and model and then the output parer and as you can see I'm initializing LMA 3.1 from chat Grog okay uh so going to run this right here I actually think I forgot to execute this one right here there we go and uh now just show you what texts uh look like because remember that we um split all of our elements into tables texts and images within the retrieved documents uh text itself remember that it is a composite ele I mean it is a collection of composite elements right and um what does that mean remember that I told you that it means that in its metadata there are the original elements and that essentially means uh I have to tap into the first one just to show you that essentially means that all of the elements are within original n elements like this okay but however you can still print it oops you can still print it like this and it's going to show you all the elements of that um of that collection of elements okay of that chunk so here I have my scroll B element um and as you can see here's all the collection of all the text for this first chunk as you can see it's just the title and the abstract that's the first chunk um now what we're going to to do right now is we're going to use that in order to summarize it okay so we're going to pass in every single one of those texts every single one of those composite elements and we're going to summarize them and I'm going to show you what it looks like on Langs Smith a little bit later on if you want but the idea right here is that it is going to take the entire contents of all of the elements within the composite element it's going to batch um execute the summarized chain and that's going to go to summarize and the same thing is going to go to the tables however let me show you something quick about the tables so the tables are the tables basically look like this okay we have four tables in the document and let me show you what it looks like so two dictionary and actually oh I'm going to have to click on the first one right here and you can say this is the first table that we have uh we have the element ID we have some text within it and then it has this very convenient feature I mean very convenient property which is text as HTML and that essentially means that it is the extracted table but in HTML format and this is the only thing that we actually need to send to our language model in order to summarize it right we don't really need uh it original elements because if we try to tap into the original elements for example um let's see um metad dat original elements let's see what it looks like um to dictionary um so look like anything no there's no to dictionary here um it's a take apparently I don't know why I have a b 64 thing right here but um yeah I mean what I wanted to show you is that we have the HTML code inside of here and this is the one that we're going to send to our language model in order to actually get the summary because remember how mean if you have a language model you have to send it text uh you cannot send it just or an image if it's a multimodel but you cannot send it just the text like this it's probably not going to understand the divisions there is no headings or anything right here you want want to send it the actual mark down uh the actual markup language uh so that it can understand where is the header what is the table cell Etc so these are the ones that we're going to be tapping into and that's what's actually going on right here so I say that the tables HTML is actually going to be the property text as HTML of each table in the array tables and um then I batch that too let me just execute this take a few seconds and let me just show you what it looks like have it right here so uh that's where the text summary is as you can see we have one two three bunch of text summaries like a about of composite elements all of these are the composite elements um and here you have every single summary and then we're going to do let's check the same thing for the tables table summaries table summaries and here we have four tables the table Compares four types of neural network layer self attention concurrent Etc and as you can see these are the summaries that I am going to be uh vectorizing and embedding uh and adding to my Vector database and now let's do the same thing for our images and in order to do that we're going to be using open AI of course so first thing to do is to install it there here we go and similarly to the previous examples uh we're going to be also creating a chain that is going to summarize the image but in this case we're not going to be sending a regular um PR a regular uh prompt like we did before but we're going to be loading the message with the image itself and if you have I mean if you you can check the API documentation of whichever llm you're using to see how you you can send um an image to it um in Lang chain it is pretty uniform you just send a user a user message and uh you send whatever you want to send it as text within the type text dictionary and then you create another dictionary within it with Type image URL and then you send it in base uh 64 you know so that it understands that uh image and that's essentially how we're going to be sending that image so as you can see here we have a prompt template that is going to take only one variable which is image and here we have another template that is not taking any variable so that's convenient and um then right here this part right here is going to be the base 64 code of the image that we want to convert and in order to do that we just like initialize our chain in this case we're going to be using GPT 40 Mini because we want a language model that has multimodal input okay and then just batch the summaries let's execute that that actually takes a little bit of time when it is trying to ingest all of those uh images you can see I think we have uh 1 2 3 4 5 six seven images right here so let's see how long it takes take 16 seconds to um to process all those 16 images now let's see the summaries here we have all the summaries let's print the the first uh the third the fourth one there we have it then image appears to illustrate the attention mechanism used in Transformer architecture then we have the key elements words and tokens attention weights highlighted tokens Etc let's actually take a look at that one to see what it actually looks like so what was the name of the function that I had up here that displayed the image display base 64 image so let's use it right here and let's print images number three uh here we have the image number three let's check the number one in the Strat key concept of the Transformer architecture trying to find an easier one to visualize let's see which one was this one it was image zero uh so image zero is of course still this one that we saw before and um if we check the summary of that one we can see that the overall structure is a diagram structured into two many sections we have the encoder and the decoder then some errors and connections okay so this is exactly what uh we want um to embed okay these are the summaries that we're going to be vectorizing and adding to our Vector database um I'm going to remove these two samples right here just to make it easier but there we go so that was creating the summaries of all of our elements and as you can see we already have um the three arrays with um the images the text and the tables and then we have other three arrays with the corresponding summaries for the text the summaries for the tables and the summaries for the images now what we're going to have to do as we saw in the diagram that we saw before we're going to have to link them together using an ID and that's what we're going to be doing right now and then we're going to be loading them to our Vector database and to our document store okay so now it's actually time to start talking about how to load those summaries and the elements that we want to load into our Vector store and to our document store uh and in order to do this is actually very simple we're going to be using this langine abstraction called multi Vector retriever it's actually pretty straightforward and uh what we're going to be doing with this one is what we saw before we're going to be creating an ID for every single document and we're going to add it as metadata to both our summary and to our document and the document is going to go to the vector store which is right here and the sorry the document is going to go to the document store which is right here and the summary is going to go to the vector store and it is the summary which we are going to uh retrieve using semantic uh search semantic similarity and uh once we have retrieved the summary we're going to check the ID of the document that it's in its metadata and go fetch the corresponding document in the document store that has the same ID okay and that's essentially all that we're doing and uh this is what the multiv vector uh retriever does in langing now of course uh you can code this yourself if you want you you're not forced to use multi Vector retriever I just feel like this is a good level of abstraction to start to to stop that uh because I feel like what's going on under the hood is pretty self-explanatory and um this one right here does is very very simply so you just pass it the vector store you pass it the document store and you pass it the ID um that it's going to add to the metadata I mean the key that is going to add to the metadata to connect both of them that's essentially all that we're doing so we're pass we're initializing a chroma Vector store we're initializing then a document store in memory in this case and we're initializing a metadata ID which is going to be document ID like this one right here and uh then we're just loading everything into I mean just creating this abstraction that is going to help us link them together and this retriever is actually going to just return to us the documents that are going to be relevant for this it's not even going to return to us to summaries it's only going to return us to documents so let's execute this right here now we have created this and now we can actually start loading our documents now that we have created it this is actually empty for now okay and now let's actually just load every single thing that we want to load so the first thing that we want to load is the documents and just as I showed you before we have to create some IDs for each one of them uh sorry the first thing that we're going to load is the texts which are the composite elements and um as I told you before well first we're going to have to create an ID for each one of them so this array right here is going to create a u ID D for every single uh element in text and then we're going to append that ID we're going to add that ID to the metadata of every single document that we're going to be generating so this essentially is just a oneliner that creates a document uh a lang chain document for every single composite element that was returned to us from unstructured which is in with which is in the text um variable okay so this is creating the summary texts um documents then we're loading those documents to our our Vector store and then the actual text the actual uh composite um the actual composite element the composite element that we extracted from unstructured is going to go to the document store and this one is the one that is going to be retrieved not the summary the summary is only used for finding it but the one that we're going to actually get from the retriever is this one right here okay uh we're going to do exactly the same thing with tables we're creating a langing document in case you don't remember I imported document from here from Lang chain. schema actually I think this is wrong that is old school I think now it's from Lang chain core documents we import document yeah W this this is all code all right uh let me just fix this from L chain open AI uh we're going to import open embeddings yeah I don't know what this is old code L from line chain core. retrievers we're going to import multi Vector retrie there we go um is this working correctly uh multiv Vector retriever um actually I think this is yeah sorry this is the only one that actually comes from lanch and retrieval mode Vector so there we go and then we're going to add every single thing uh we're going to be doing exactly the same thing for the images just generating an ID for every single image creating a document for every single summary and then just adding the images themselves to the document store and it is the image themselves it itself in Bas 64 which is going to be retrieved okay so it's going to take a few seconds to load everything and now we have everything within our uh retriever our document store which has a assigned a vector store assigned to it and now we can actually start testing it so now if I do what is multi-head attention on this retriever that I have right here remember that my retriever is a multi Vector retriever okay so now I can essentially just execute this and now the chunks are going to are going to be right here actually let's just see there we go so the first chunk is a composite element the second chunk is actually a base 64 thing which has to be an image because that is what we added to our document store here we have another composite element and another composite element so pretty convenient now right here this is some extra code uh you don't necessarily need it this is high code some code that actually was available at um one of Lang Chain's um documentation Pages I'll add a link to that one in the description essentially just renders the page and highlights uh whatever um elements you send to it I just had to update that a little bit but uh let me show you I'm just going to create the function and let me show you what each one of those chunks that we just retrieved has inside of it okay so these are our four chunks okay and remember that for the composite elements and they actually have some more elements within it okay so this first one we're going to tap into this first one is a composite element remember that it has a lot of things within it and uh let's actually just take a look at what it has inside of it so for every single composite element we're going to check first we're going to print the number of it and uh then and we're going to just uh write down what type of element it is and which page it is on okay just to just to see so you can see that the first chunk that it retrieved has a title narrative text and all the way to a list item and you can see that it spans from page four to page five pretty good here we have the chunk number two Let me just run this a scrollable object here we have chunk number two you can see it has a title a narrative text the footer all the way to a narrative text and it actually spans from page two all the way to page four pretty good same for chunk three all right pretty good now let's actually take a look at the first chunk right here um I think I forgot to do something right here chunks. metadata oh yeah because this one right here is actually the image uh let's open the first one all right so the first chunk and the first original element from the chunk is a title just like we saw before it's in page number four there we go it came and actually right here I just coded a couple of quick uh functions I don't want to to to confuse you with this functions essentially all they do is they extract which Pages a given chunk has because remember that a chunk in itself uh contains a bunch of elements right so the first first one contains uh as we saw right here the first one contains the title from page four all the way to a list item in page five uh the chunk number two contains um all the way from the title in page number three to a narrative text in page number four um and here I just coded a very quick two functions that actually display the IM the the picture the pages of whichever um chunks you pass to it so here I'm going to pass the chunk number the first chunk and um wait what uh this one was actually not useful all right so there we go uh the first the the fourth chunk uh contains an introduction background Etc okay so it's only one page this one right here uh let's see let's check the chunk number two it is supposed to span from page number three all the way to page number four so let's check that one right here so which one was it chunk number chunk number two so let's see that one chunk number two so it spans from here from attention all the way to page number four and these are the elements that were retrieved so as you can see this Chunk in itself is pretty self-contained and it contains all of the information that we need uh that is related to this particular topic and that is why it is so useful and so important to use uh chunking uh by title in this kind of documents when you're using unstructured it becomes super easy because everything I mean the entire chunk is interconnected it's not like it's not like it split the text randomly and it sto chunking here and then the next text the next uh chunk starts here it literally is by titled which is very very convenient and it actually has the images here here um now in this particular example I am not extracting the images from the chunks themselves because as you may have noticed so far I am embedding the images separately but you could just extract the images from here and that would work pretty well uh so now that we have that we can actually start uh creating the rack pipeline which uh we have all the element so far we have the retriever that is working um the retriever is multimodel actually let me show you that it is actually multimodel uh where was the function right here that we had that could display images uh this one right here so this um this query retrieved four documents and as you can see the second document is a b a base 64 document so let's take a look at it so we're going to check chunks and we're going to check the second one right here this is going to tap into this element right here okay so let's execute so there we go we have retrieve the image we have retrieved all of the documents themselves and um now we can essentially uh start creating our rack pipeline so let's actually get to do that right so now in order to create our rack pipeline essentially all that we're doing here is we're actually going to be using a couple of helper functions uh first of all we're going to be importing the runnable pass through and runable Lambda uh things we're going to be using and um and remember that this chain right here is supposed to actually give you an answer based on the retrieved documents okay and uh that essentially means that the docu the since the documents that will be retrieved some of them will be images um the chain has to include a language model that has multimodel input uh which is the case with chat with GPT 40 mini right here and that's the model that I am using for this one right here okay so I created a couple of um chains right here the first one is the simpler one and

Original Description

This tutorial video guides you through building a multimodal Retrieval-Augmented Generation (RAG) pipeline using LangChain and the Unstructured library. You'll learn how to create an AI-powered system that can query complex documents, such as PDFs containing text, images, tables, and plots, by harnessing the multimodal capabilities of advanced Language Learning Models (LLMs) like GPT-4 with vision. We begin by setting up the Unstructured library to parse and pre-process various document formats, from images to text. Then, we use LangChain to establish a document retrieval system that integrates textual and visual data into a multimodal LLM, enabling comprehensive understanding and accurate, relevant responses. This method is perfect for tasks requiring insights across multiple data formats, such as technical documents, scientific papers, and presentations. Whether you're a beginner in multimodal pipelines or looking to improve your RAG workflows, this step-by-step guide will help you create an intelligent document querying system that goes beyond text, broadening the scope for real-world applications. Don't miss this opportunity to make document intelligence genuinely multimodal! Topics === 1. How can you set up the Unstructured library to parse and pre-process diverse document types? 2. Want to learn how to create a document retrieval system that utilizes both textual and visual data? 3. Discover how to integrate multimodal data into a LangChain-powered Retrieval-Augmented Generation pipeline! 4. Uncover the benefits of using a multimodal LLM for more comprehensive understanding and accurate responses. 5. Create an AI-powered document querying system that goes beyond text, expanding the possibilities for real-world applications. Links === 🚀 Zero-to-hero AI Engineer Bootcamp: https://www.aibootcamp.dev/ 👉 Code on this video: https://colab.research.google.com/gist/alejandro-ao/47db0b8b9d00b10a96ab42dd59d90b86/langchain-multimodal.ipynb 📽️ Introduction to RAG:

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Alejandro AO · Alejandro AO · 44 of 60

← Previous Next →

Linear Regression in R - Full Project for Beginners

Linear Regression in R - Full Project for Beginners

Configure Webpack 5 in Wordpress (2025) with Typescript and SASS

Configure Webpack 5 in Wordpress (2025) with Typescript and SASS

R Programming 101 - Crash Course for beginners

R Programming 101 - Crash Course for beginners

Convert HTML template to WordPress Theme (2025) - Full Course

Convert HTML template to WordPress Theme (2025) - Full Course

Javascript Interactive Map with Leaflet EASY (with Marker Clusters & Popups)

Javascript Interactive Map with Leaflet EASY (with Marker Clusters & Popups)

Vanilla JS Project: Multi Step form in HTML, CSS & OOP Javascript

Vanilla JS Project: Multi Step form in HTML, CSS & OOP Javascript

How to do AJAX in WordPress correctly (2025)

How to do AJAX in WordPress correctly (2025)

React Leaflet Tutorial for Beginners (2025)

React Leaflet Tutorial for Beginners (2025)

Linear Regression in Python - Full Project for Beginners

Linear Regression in Python - Full Project for Beginners

Logistic Regression Project: Cancer Prediction with Python

Logistic Regression Project: Cancer Prediction with Python

Display Equations in ChatGPT

Display Equations in ChatGPT

Create a Chrome Extension (Manifest V3) for ChatGPT

Create a Chrome Extension (Manifest V3) for ChatGPT

Full-Stack Project | ChatGPT API, React, Node.js, Express

Full-Stack Project | ChatGPT API, React, Node.js, Express

Streamlit Python Course: Build a Machine Learning App to Predict Cancer

Streamlit Python Course: Build a Machine Learning App to Predict Cancer

Langchain PDF App (GUI) | Create a ChatGPT For Your PDF in Python

Langchain PDF App (GUI) | Create a ChatGPT For Your PDF in Python

LangChain Memory Tutorial | Building a ChatGPT Clone in Python

LangChain Memory Tutorial | Building a ChatGPT Clone in Python

Chat with a CSV | LangChain Agents Tutorial (Beginners)

Chat with a CSV | LangChain Agents Tutorial (Beginners)

Create a ChatGPT clone using Streamlit and LangChain

Create a ChatGPT clone using Streamlit and LangChain

Chat with Multiple PDFs | LangChain App Tutorial in Python (Free LLMs and Embeddings)

Chat with Multiple PDFs | LangChain App Tutorial in Python (Free LLMs and Embeddings)

Full Python Environment Setup for AI (or other) Apps + Virtual Environments

Full Python Environment Setup for AI (or other) Apps + Virtual Environments

Langchain + Qdrant Cloud | Pinecone FREE Alternative (20GB) | Tutorial

Langchain + Qdrant Cloud | Pinecone FREE Alternative (20GB) | Tutorial

LangChain Version 0.1 Explained | New Features & Changes

LangChain Version 0.1 Explained | New Features & Changes

Create a RAG Chain using LangChain 0.1 (New version)

Create a RAG Chain using LangChain 0.1 (New version)

Tutorial | Chat with any Website using Python and Langchain (LATEST VERSION)

Tutorial | Chat with any Website using Python and Langchain (LATEST VERSION)

Deploy Your AI Streamlit App for FREE | Step-by-Step (Heroku Alternative)

Deploy Your AI Streamlit App for FREE | Step-by-Step (Heroku Alternative)

What is Google's Gemini 1.5 Pro | 10 Million Token Window

What is Google's Gemini 1.5 Pro | 10 Million Token Window

Chat with MySQL Database with Python | LangChain Tutorial

Chat with MySQL Database with Python | LangChain Tutorial

Stream LLMs with LangChain + Streamlit | Tutorial

Stream LLMs with LangChain + Streamlit | Tutorial

Chat with MySQL Database using GPT-4 and Mistral AI | Python GUI App

Chat with MySQL Database using GPT-4 and Mistral AI | Python GUI App

#1 Harrison Chase: LangChain and The Future of LLM Applications | Alejandro AO

#1 Harrison Chase: LangChain and The Future of LLM Applications | Alejandro AO

CrewAI Step-by-Step | Complete Course for Beginners

CrewAI Step-by-Step | Complete Course for Beginners

Python: Automating a Marketing Team with AI Agents | Planning and Implementing CrewAI

Python: Automating a Marketing Team with AI Agents | Planning and Implementing CrewAI

Build a Web App (GUI) for your CrewAI Automation (Easy with Python)

Build a Web App (GUI) for your CrewAI Automation (Easy with Python)

Early days of RAG and LlamaIndex - Jerry Liu

Early days of RAG and LlamaIndex - Jerry Liu

LlamaParse: Convert PDF (with tables) to Markdown

LlamaParse: Convert PDF (with tables) to Markdown

#2 Jerry Liu - What is LlamaIndex, Agents & Advice for AI Engineers

#2 Jerry Liu - What is LlamaIndex, Agents & Advice for AI Engineers

CrewAI + Exa: Generate a Newsletter with Research Agents (Part 1)

CrewAI + Exa: Generate a Newsletter with Research Agents (Part 1)

#3 Joe Moura | Multi Agent Systems and CrewAI

#3 Joe Moura | Multi Agent Systems and CrewAI

Python: Create a ReAct Agent from Scratch

Python: Create a ReAct Agent from Scratch

New Groq Models: Best for Function-Calling Agents

New Groq Models: Best for Function-Calling Agents

Introduction to LlamaIndex with Python (2025)

Introduction to LlamaIndex with Python (2025)

LlamaIndex: How to use LLMs

LlamaIndex: How to use LLMs

LlamaIndex: How to Get Structured Data from LLMs

LlamaIndex: How to Get Structured Data from LLMs

Multimodal RAG: Chat with PDFs (Images & Tables) [2025]

Multimodal RAG: Chat with PDFs (Images & Tables) [2025]

Advanced RAG with LlamaIndex - Metadata Extraction [2025]

Advanced RAG with LlamaIndex - Metadata Extraction [2025]

Learn MCP Servers with Python (EASY)

Learn MCP Servers with Python (EASY)

Create MCP Clients in JavaScript - Tutorial

Create MCP Clients in JavaScript - Tutorial

Create an MCP Client in Python - FastAPI Tutorial

Create an MCP Client in Python - FastAPI Tutorial

How to Build an MCP Client GUI with Streamlit and FastAPI

How to Build an MCP Client GUI with Streamlit and FastAPI

Vibe Coding For Engineers (make it ACTUALLY work)

Vibe Coding For Engineers (make it ACTUALLY work)

LlamaExtract Tutorial: Convert PDF & Images into JSON

LlamaExtract Tutorial: Convert PDF & Images into JSON

Local MCP Servers for Cursor (Step by step)

Local MCP Servers for Cursor (Step by step)

Anthropic: How to Build Multi Agent Systems

Anthropic: How to Build Multi Agent Systems

Deploy Remote MCP Servers in Python (Step by Step)

Deploy Remote MCP Servers in Python (Step by Step)

GPT-5 for Developers: API Changes, Pricing, Model Router & Security

GPT-5 for Developers: API Changes, Pricing, Model Router & Security

Tutorial: Auth for Remote MCP Servers (Step by Step) | OAuth 2.1 with ScaleKit

Tutorial: Auth for Remote MCP Servers (Step by Step) | OAuth 2.1 with ScaleKit

Generate UI Tests with TestSprite MCP Server + TRAE

Generate UI Tests with TestSprite MCP Server + TRAE

#4 Allan Guo | 19-yo YC Founder - Willow Voice

#4 Allan Guo | 19-yo YC Founder - Willow Voice

RAG Project: Build an AI Onboarding Chatbot with Streamlit, LangChain, and ChromaDB

RAG Project: Build an AI Onboarding Chatbot with Streamlit, LangChain, and ChromaDB

MCP Security | Malicious MCP Servers (Protect Yourself)

MCP Security | Malicious MCP Servers (Protect Yourself)

This video tutorial teaches how to build a multimodal RAG pipeline using LangChain and the Unstructured library, enabling AI-powered systems to query complex documents. The tutorial covers the use of various tools and techniques to create a sophisticated multi-vector store and extract structured data from unstructured documents. By following this tutorial, viewers can gain hands-on experience with building a multimodal RAG pipeline and learn how to apply it to real-world applications.

Key Takeaways

Install required libraries and tools
Partition PDFs and extract images and tables
Create a multi-vector store using ChromaDB
Embed summaries using an embeddings model
Query the vector database for relevant documents
Retrieve documents from the document store using doc ID metadata
Use a multimodal language model to describe images
Use a language model to summarize tables and text

💡 The key insight of this tutorial is that multimodal RAG can be used to create sophisticated AI-powered systems that can query complex documents, enabling a wide range of applications such as document retrieval, semantic search, and data summarization.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Multimodal LLMs

View skill →

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

AI Tool Journey

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Developer

Building Multimodal Search and RAG

Building Multimodal Search and RAG

Midjourney Trick: Consistent Character in Different Images

Midjourney Trick: Consistent Character in Different Images

Ollama Multimodal: EASILY setup Llava locally & Integrate API

Ollama Multimodal: EASILY setup Llava locally & Integrate API

The ONLY Real Time Speech AI that can run locally!!!

The ONLY Real Time Speech AI that can run locally!!!

Related Reads

How AI and ChatGPT are Upgrading Data

Learn how AI and ChatGPT are revolutionizing data management in 2026

Medium · ChatGPT

Semantic Caching for LLMs: What’s Draining Your AI Budget

Learn how semantic caching can help optimize LLM costs and reduce AI budget drain

Medium · Machine Learning

Running Hugging Face Inference with Kiro: From Prompt to Working Summarizer

Learn to build a text summarizer using Hugging Face and Kiro, streamlining NLP workflows

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Learn how BizNode's semantic memory (Qdrant) enhances bot intelligence by remembering past conversations and answers, and how to apply this technology to improve your own chatbots

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)