LangChain x Pinecone: Supercharging Llama-2 with RAG

LangChain · Intermediate ·🔍 RAG & Vector Search ·2y ago

Skills: RAG Basics90%Vector Stores80%RAG Evaluation70%Advanced RAG60%

Key Takeaways

LangChain and Pinecone are working together to supercharge Llama-2 with Retrieval Augmented Generation (RAG), allowing the model to access an external knowledge base through a vector database to improve its performance and relevance. The solution utilizes a parametric knowledge solver and semantic search to retrieve relevant information with natural language.

Full Transcript

below everyone we're going live with an awesome uh awesome topic awesome speakers today um but before we jump into it minor logistic things this is being recorded um so it will be available at the link after the fact and then we'll also put it up on YouTube later in the week um if you guys have questions during the during during the webinar put them not in the normal chat box but there's a little box uh for questions and answers it's the it's on the side it's got the one with the question mark in it if you put them there and then and then upload the ones that you like best um and will basically answer those in terms of schedule we'll do is we'll we'll go uh we'll do quick and little intros after this then we'll jump right into it with with uh James taking it over for a presentation then Lance and then back to James and then open it up for audience QA so that's what we'll get that's when we'll get to all the uh question answer stuff um in in the Box um that's pretty much it for Logistics pretty simple pretty straightforward maybe we can do quick intros my name is Harrison uh I work at Lane chain so trying to make it as easy as possible to build online applications uh Lance do you want to do a quick intro I'm Lance I'm also on the line chain team and have been doing most recently a bit of work with llama so looking forward to discussing uh more today and yeah and I'm James I'm at Pine Cone so uh do like more events today base and recently been playing around with llama too as well so pretty decides to share um what we've been working on awesome all right James you want to take it away second um goes figuring out to share my screen on here it is at the bottom next to the new audio turn off camera okay so it should be this can you see that okay yes it's a little small it's kind of assumed out yeah um maybe if I just go I mean I can do this otherwise it seems to get small yeah I think uh I think just the PowerPoint slides are fine cool all right that's good um so yeah we're gonna talk about retrieval augmented generation with the new llama 2 model so the problem that we're trying to solve here is LMS they work very well we've all seen that but in terms of what they do know it's limited to what they've learned during training right so uh llama to the training cutoff is I think for some of the fine-tuning like fine-tune chat model in June which is pretty recent but that means if you're asking questions about for example line chain which is a very fast-moving you know changing Library there's a good chance at the information that the LM nodes is going to be very outdated dated which is kind of you know a bit of a problem when you're working or wanting up-to-date information another issue is you know the data that LMS like longitude been trained on is not necessarily going to cover the data that you would like to ask questions about so if you're working in an organization you might have like internal organization documents that you would like to be considered by your llm is no it's not going to work unfortunately so this is an example where this was the first version of gpt4 and I asked it how do I use Airline chain line chain and it just told me about this blockchain based decentralized AI language model which is not the line chain I was trying to ask about and I'm also not sure about this information being correct anyway so yeah this is the issue that we have when lens don't have access to the external world so the solution is to give them access to that external world um how we actually do that there's different ways so let me come down to here what we have at the moment is this we have a lam the knowledge solvent within it is called parametric knowledge that's everything from the train data it's kind of Frozen in time all we want is this so we actually want to connect what we call a knowledge base and more specifically in this case a Vex database to our LM so it's going to have both like the um the processing power of the brain that an llm has but we also have this database which allows us to actually manage the information that Elm knows okay so it's like a traditional database right you want to be able to add the update information but you want to connect that to your llm so how do we feed external knowledge into the llm so one thing like like the first approach that most people are going to take is okay we're going to put everything into the context window um which is okay it it depends on how much data you have basically so GT4 if you get access to the 32k model has a maximum contact so under 32 000 tokens uh the the cloud model goes up to a hundred thousand tokens which is pretty big let's sell just 75 000 words um or about 150 pages of text now if you're just wanting to kind of like uh chat with a single document it's fine it'll work but if you're for example chatting with your company documents then all of a sudden you probably have more than 150 pages of company documents within your organization so things start to get a little a lot more difficult because you exceed that maximum limit of the context you're in there and even if we could fit everything into that context window there's a lot of recent research that tells us that that's not actually a good idea so this um this paper from Stanford called Lost In The Middle explored the idea of okay what what is the performance of the model uh when we ask you about information that we put into the context window and how does that performance vary over time or sorry not over time um over the amount of text that we put into that context window and they found that the more stuff you put into your content so in the less performance it is and they have another graph not this one I'm showing here but another graph where you have basically like a a u-shape and that is showing that the llm will pay attention to things that are at the start of your context and at the end of the context but basically ignores everything in the middle um particularly with longer context models so something you need to be wary of and something that we kind of want to avoid we don't want to stuff everything into contest when they want to be more efficient with what we're putting in there so is there a better way ideally we need to very selectively feed just the high relevant bits of information into our contacts to avoid this context-sipping issue um lm's work in natural language so it would be great if our search could do that too and using a vector database we kind of satisfy both of those requirements so we can retrieve relevant dots with natural language and that is something that we would call semantic search so what is semantic surge a sort of semantic meaning so what is the semantic meaning of a word so it moves first through examples here at the web Bank um but it's like in a traditional keyword search all of those would mean the same thing with when you consider semantics they do not consider the same uh mean the same thing because they have a different context they're being used in a different way the actual meaning of those words is different and we as humans we understand that and top the bottom two sentences they don't share any any keywords but in meaning they are basically the same thing so what we want to do with semantic meaning is we want to put things that in a human way have a similar meaning in a similar space and things that in a human way have a different meaning in a different space so that's where semantic search comes in semantic search works using language models but not generative language models um so rather than generating text these language models actually generate embeddings so you can think of them as essentially translating from Human readable text like this query where is Normandy into machine readable vectors okay and within that space relevant chunks of text will be translated into very closely related vectors within that Vector space and that's our Vector database essentially works so you have all of these vectors within your vet space we saw them you can manage them and what uh database like Pancham will do is allow you to enter a new query Vector which is going to be your query translated into a vector and very quickly search that space so you can you can do that over like billions of items right so we don't have um the issue with only being able to put in like 150 pages of text James I got a quick question for you and this is jumping a little bit ahead but what do you think we're talking about um or we will be talking about is like using kind of like llama too and like one of the benefits of that which is like open AI is is basically it's uh you know it's an open source model you can run it locally you could there's a it's you know there's none of kind of like the Privacy concern stuff like that and so like just as we use that for text generation there's open source yeah embedding models as well um yeah so have you noticed like do you have any tips and tricks on like is there any difference between using kind of like open source embeddings with Pinecone or open AI embeddings both in terms of like I don't know I know the different sizes right so they have different size I know you guys have some different search methods do one work to do some work better for open Ai and others for other ones or open AI is the best out there like how should people be thinking about a lot of these like open source embedding models yeah so I'd say okay I'm betting is probably one of the more performance ones out there but a lot of case you don't actually need a lot of performance um so one of the in fact later on in demo one of or the embedding model that we use is an open source model and it's actually a very small model like it's you can you can put you can run it on CPU and it's it'll be fine um so yeah it it varies between the two but with that tiny model we actually get performance that's good enough for our use case um so you don't always need something like opening eyes um too which is a very good embedding model but not 100 needed uh the other thing that you know you can also consider with this is opening eyes embedding model is like uh 1500 dimension this other one that we're going to demo is only like 380. so that means the amount of storage space you get in Pinecone is like uh four times higher maybe even more um so this sort of pros and cons to both approaches um just kind of depends on what you what you want to do it's also to fine-tuning aspects uh these open source models are actually pretty easy to fine-tune if you have your own your own data cool so yeah um so going back to the retrieval augmented generation piece um what we just explained is can these two components here so the we've got the query goes to the embedding model and it goes to Pinecone and it retrieves relevant contents then what we want to do is feed those relevant contacts and the initial query back into our llm so we're we're getting that external knowledge from Pinecone and being into a lamp so it is now kind of connected to the world or at least connected to the part of the world that we would like it to be connected to so pros of this approach we like we're getting going to get retrieval highly relevant dot using natural language search or semantic search you can scale that to billions of Records we get data management like a traditional database and we don't need to do any context stuffing so we don't get that performance uh degradation that we saw with the other examples so yeah that's it on the on the retrieval piece um I'll hand it over to Once yup perfect segue and you know building on all that blank chain is an application development framework that makes it very easy to build kind of retrieval augmented generation applications and pipelines um it's a little bit hard to see but on the left you can see kind of a sense for the different data connectors for Lang chain there's 135 and I actually link to our integration sub at the bottom of these slides um but these allow you to connect data from all sorts of different places from structured sources unstructured sources public and private James talked a lot about the importance of of kind of operating on private or company data and of course Lane chain has connectors for um really most of the types of data that you'd want to be working with the second thing langtune provides is different embedding models so James talked about for example open eye embeddings being over a thousand Dimension versus some open source models I mentioned gnomec here they have gpd4all embeddings that came out pretty recently uh hugging face and other models and so we have over 20 over 20 Integrations with different embedding models of course we've integrated with any Vector stores Pinecone being one of the primary Integrations we have and kind of finely and relevant to our talk today with many llm Integrations and in particular we have Integrations with the Llama CPP Library which is basically a framework along with python bindings to run llama models um that works with llama V2 and um I provide some documentation at the bottom which kind of shows how this can be run in particular with Lama V2 but I can go to the next slide James I have a quick question here while we're talking about kind of like just different model providers um there's also like uh you know hosting services like replicate that that kind of like or replicate I don't actually know how to pronounce it that offer um uh hosting of like llama2 how should people think about using llama CPP versus like replica or what are the pros and cons and yeah this is a good point so to be honest running llama V2 locally is a little bit tricky and for example I can run on my laptop but I have an M2 Mac um uh Max 32 gigs I can run about 30 or 25 to 30 tokens a second but many people don't have a computer that for example has that kind of CPU I'm actually also running on GPU take a little bit of time to set that up so basically if you're an organization you have a lot of resources if you're an individual you have a higher performance laptop you really care about privacy running these things locally is a great option but alternatively you know if you don't want to if you don't have those resources or you want something that's faster using an endpoint like replicate is is very easy and we also have integration there um so I think it's a trade-off between how much do you value privacy and the ability to run it locally for example on your machine um versus ease of use and these endpoints I replicate are quite easy to use it's just like hitting open AI or any other kind of external endpoint but that's kind of important thing to highlight this is actually I just added a few figures from the paper maybe the high level takeaway is that long V2 is about as good as GPD 3.5 or chat gbt for language and math but it does lack a bit quite a bit on coding and this is like the Highlight in the paper so that's kind of a way to think about it it's kind of like roughly as good as Charity BT except if you want to do coding tasks then it it draw it's quite a drawback relative to other open source models it's quite strong and then you can see relative to more performant closed Source models of gbd4 it does still lag um and relative to the first llama it's notable that the context length has doubled so that that's the figure at the top contact length went from around 2K to now 4K tokens and it's trained on around 2x more data so that kind of gives you the landscape for the Llama model itself one uh one interesting thing here I actually haven't seen this figure before but um when openai had their uh like their code model um whatever it was called uh code or DaVinci Code whatever um I think a lot of people are like using that over text DaVinci for some of the more like agent-like and like reasoning things because there's a lot of like like code has a lot of like nice properties where there's pattern recognition and structure and stuff like that and so I mean this is again jumping the head a little bit but I think we I've at least tried out llama two for agents and it hasn't been amazing and so I wonder if like that like that Gap actually partially explains some of that like even though it's good at like writing and you think like agents are about writing things like a big part of it is pattern recognition and kind of like things related to code and so maybe that's actually a good uh yeah maybe like the coding Challengers are are good ones to look at for like some indication of how good they'll be at like reasoning tasks or something like that yeah that's an interesting point and actually James I I think you played a little bit as well with llama two and agents maybe we could speak about that later but um that is that actually may be a reasonable hypothesis as to why it may lag a bit with agents um you can see his coding capacity is is quite a bit worse even than GPD 3.5 um and of course ubd4 kind of is state of the art across the board which is kind of expected um but it's also notable one other thing I'll flag here is that the context like this is reasonably good now of course it lags what you see with larger open source models but 4K tokens you know time what is it four and a half characters per token that you're able to fit quite a bit of context in there for retrieval augmented generation assuming say a chunk size of 1500 characters you know that's quite a few chunks so um it is pretty reasonable for retrieval augmented generation in that respect and considering has a language capacities of gpt35 there's a reasonable hypothesis that it is a quite promising model for retrieval augmented generation um and maybe James you can go to the next slide um you might actually have to play this video um yeah this is more of just like a a fun one but this is showing this is running locally my laptop uh on my Mac M2 um on GPU I can get like uh around you know 25 tokens a second so not bad it shows the fact that it is pretty cool you can run these models yourself locally on consumer grade um uh hardware and this is provided in the notebook um this is an example of retrieval major generation with streaming in line chain with uh the llama2 model um so maybe this is it sets the stage maybe to dive in a little bit more to James's notebook in particular and then we can kind of just keep kind of discussing and move on to questions yeah definitely so let's move on to the demo um so you should be able to see um so we'll just go through really quickly the yeah I can I can talk through these these first components as well you know because it's already embedding component comes in so as we kind of mentioned we're actually using this open source model so it's called a sentence Transformer and we use this one here so it's record it's literally called mini language model uh it's very small super easy to run and we load that in through hiking face but the uh the line chain Library so that kind of just wraps everything up so we don't need to worry about actually creating those embeddings with a few lines of Pi torch code so we wrap it up within the home face embeddings class um and then this is just an example of how we actually create those so we have these two like documents which is like 200 text and we just embed those like this and we can see that from that we get two document embeddings and each of those has a dimensionality of 384. so you know a lot smaller than the open AI embeddings which are you know 1 500 and something um it's much bigger so once we've created our embedding model next thing we're going to do is create our Vector index that's affect the database uh with that we need to get a pine cone API key and then we just put those details in there and we'd create our index and this is probably at least on the Pinecone side of things this is the only point where it's going to vary between um how you would use like an open source model versus like open AI models so index name it's the same there's no difference there but this is a different bit so the dimensionality and also possibly the metrics so the dimensionality is rather than the 1500 that we use open now it's like a 384 um or so that we have this model and then also the metric so with opening embeddings we're going to either use cosine or dot product um with this model we have to use cosine we can't use dot products as if I'm correct and there are also some models out there that you would have to use euclidean so you just have to be careful when you have to usually it says in the model card if you're looking at like hooking face models um you can usually see in the model card what you need to use there so that's the only actual difference you need to make to your code in order to get this working Okay so I quickly scraped a few archive papers I relate to llama to um including the alarm 2 research paper and I just pulled those in I embedded them so we have that embed documents here and stored them all within python so at the end of that we can see we have it's pretty small like I said we can have billions of documents in here in this case we just have like four thousand uh just under five thousand um Okay cool so moving on we now have that like everything embedded we have all our documents sold in pinecan now what we'd want to do is like hookers out to a llama 2 model so I'm using the 13B Lam 2 model here um we're using quantization so we can fit this on to the statute of free version of collab so we've got the T4 GPU up here so you can convert this onto there which is I think is pretty cool um to do that you do need access so you have to like sign up for Access with meta and then you need to pass in your homophones authentication token um but yeah we load that model we load the tokenizer for that model and then we create a texture Innovation pipeline okay and with that we can see we can generate our text but right now everything we just said is within the hug and face Library which doesn't have all of like the the processing pipelines or the like agents or chains that line chain has so what we now need to do is take that and basically just insert it into the line chain Library which uh obviously like Jade has an integration for that so we have the quick and face pipeline um and we just feed in our tattoo and ocean pipeline there and now when we run it we're actually running it through line chain which is great because now we can access all the other components um of line chain including the retrievable QA chain which is basically like a super easy way for us to do retrieve augmented generation um so there's two methods that we could do this with we have to retrieval QA or retriever QA with sources chain the with sources chain would just allow you to um basically return where you're getting the information from which can be pretty useful in in a lot of cases um so when we're saying that up we would load our python index again through line chain and we can just confirm it works so if I say I'll mix alarm is too special we're going to get a load of like documents returned from like various points in the in the database now looking at these are kind of hard to read uh but fortunately the model actually manages a lot better than I do with that so we just move on to the uh creating that retrieve augmented generation pipeline we include our LM we include the retriever which is just the vector cell um and yeah we move on to actually asking some questions so the first thing I want to do here is just try and ask um the lens I'm going to ask longitude about itself without any retrieval augmentation and here we go okay long two is unique and special animal for several reasons um yeah alarms are known for the size and other things um yeah they're silky to the touch you know it's talking about their online uh which is is fine but basically llama 2 doesn't know about itself because it's it's too recent so what if we try it with our road pipeline um so we get okay alarm two is a collection of retrain of fine tunes uh large language models so yes it knows what we're talking about this time um optimized for dialogue use cases and out form other open source chat models on the most benchmarks of tester which is you know one of the points I do make it special so that's cool now let's try some more again let's let's just uh torture the alarm 2 model without retrieval augmentation a little more and that's about safety measures um and I don't know what it's talking about here but it's definitely not what we want so let's skip that and try it with retrieval augmentation okay so now we do get relevant information so development and longitude safety measures pre-training fine tuning and model safety approaches and they also delayed the release of the 34 billion parameter model because they didn't have time to Red Team okay so that that's again pretty cool answer um but I want to know a little more so you know what what are these red teaming procedures um that they use belong to and then we get this answer it was just kind of explaining um what those actually are and finally that's one more question how does it perform uh as alarm to other local LMS and we see that the we'll get this so long to platforms other models on serious helpfulness and safety benchmarks they've tested and it appears to be on source with some of the closed Source models and so that's the the Jeep G 3.5 results that we saw earlier so yeah that's our little example we can clearly see that llm performance by itself on like this this information it's not that great as soon as we add in that pipeline we add in retrieve augmentation performance goes up quite a lot and we actually get relevant results and this is just using out of the box like prompts and everything through Lang chain we haven't even you know didn't even need to modify any process so it's pretty cool that it just works um okay great so let me switch back to the presentation and yeah you can I think that that's actually downloads it for the slides awesome one thing that I want to ask and this will probably kick off a larger discussion but like towards the end you said you know it works with the default prompts so I I kind of got like two questions for both both plants and James which is like um one like is this the four first like open source model that kind of just like works with these default prompts have you noticed other ones how does it compare James I know you did great kind of like video on on Falcon as well maybe we can just start with that one yeah like how like you know a lot of the prompts in link chain are optimized for kind of like open AI models um just because that's what people like llama 2 seems to work reasonably well with most of them like is this the first open source model to do that or are there others that do that as well I'd say it's not not the first like at least in this use case where we're doing like just retrievable QA our retriever QA is a lot simpler of a task um than say for example the using agents right if you if you try and get llama like I I try to get along Ascent to be working as a conversation agent and eventually after like a lot of you know like prompt engineering like getting the output passes right like it did work fairly fairly well but that's like a 70 billion parameter model as soon as I try with like the 13 billion parameter model like it it does through a lot more um whereas yeah like falcon 40b I I think it also you can get it working just about uh in in that instance so it depends on the complexity of the test with retrieval QA um I think there are other open source models that can work with this as well foreign yeah I can build on that a little bit and in the chat I'm going to share something that I uh kind of tweeted out a while back and James I'd be curious to your thoughts on this as well looking through some of the Facebook code uh I kind of identified that appears llama will recognize a particular tokens for example system versus instruction uh tokens that can be included in your prompt and I have tried that only empirically I think it is a bit better I haven't systematically evaluated I'm curious if you played with that at all or have you observed anything there I think I saw you reference that on Twitter is that something that you've also observed yeah yeah so with those tokens it does work a lot better um I think I think I was I was kind of like going around in circles trying to get it to work for a long time and not realizing that these tokens are right thing uh then I kind of like stumbled upon them yeah like added them in and then that was when it actually started working um and one other thing that was kind of interesting and I I kind of I'm not sure if this is the correct entirely the correct way of using them or not um but they did mention the paper that over multiple interactions um as a like as an agent it seems to forget the original instructions and I found that as well right so as a conversational agent over multiple interactions uh it would forget to Output the Json format and then it would just kind of start chatting instead so uh what I added in and what they mentioned in the paper is that if they insert some the instructions into the user query um then it will continue over more interactions so that's what I started doing and so I used these instruction tokens on either side of the instructions I insert into the user query and after doing that it was just like a brief reminder it wasn't like the full-on instructions that included within the system message it was just something like remember to Output in Json format with action and action input keys and just adding that little sentence like improved the performance a lot then it was like every time almost every time it was like a perfect like response for an agent which is pretty cool got it and in the notebook we just showed I believe you're just using the default retrieval QA prompt which as you showed does basically work but I guess what you're saying is with these Special tokens you have further enhanced uh performance yeah I think the idea is um sorry I use not with retrievable QA I didn't use these special tokens um I should also add that these special tokens are specific to the chats the fine-tuned chat version as far as I understand um at least see the instructions um so there's also that but yeah they within the retrieval QA I haven't used them before so I don't know but I'm not sure how the performance would vary there yep I'm adding something to the chat now as well actually today we just um merge a new a web research Retriever and in the code there we actually toggle between different prompts depending on the model choice so it's a nice reference as to how you can select different prompts within Lang chain using conditional prompt selector and it can kind of automatically detect the model and choose the corresponding prompt so that's like a nice trick that can be used particularly when you're like toggling between llama and for example open AI or other providers um it's just a nice thing to keep in mind to the point around the chat model as well one thing which I actually just tried out and it didn't work as well as I thought but like you know a lot of the chat models are very verbose and like want to respond with chats and stuff and for a lot of the agent stuff kind of as you were saying James you just want a structured response and you want to do like some pattern recognition stuff so I tried out one of the just like base llama models um with some of the old school prompts for for agents that were more about pattern recognition with the hope that it would work well there didn't really work out but I do think there's something kind of like there where like um you know chat models are verbose been kind of not great for agents and maybe the base models can actually be better for some of the instruction following especially for the less powerful models like this but um all right I'm going to jump into some of the questions that we're getting a lot of them are around retrieval which is awesome um the top one is can you speak about how much performance boost you can get with hybrid search retrieval and um yeah maybe we can even Zoom this back out to like you know you you talked a lot about kind of like similarity search what other tips and tricks can we do on top of similarity search whether it be hybrid whether it be other things to improve performance and then how would you recommend people getting started and exploring those yeah so yeah there are there are loads of extra things you can do so kind of what we describe here it's almost like the base version of uh semantic social or perpet search um so hybrid search hybrid search tapes both this what you see here where we have like the what we call dense vectors where you're kind of encoding the semantic meaning into these vectors um but then it also kind of measures it with the more traditional search where you're kind of looking at more like keywords so like uh in this case here with this like the first three where you're looking at bank um and more traditional search a keyword search might actually rate those things as being very closely related whereas a semantic search would not now you know why would you want to go with that traditional search well in some cases that can actually be very useful particularly if you have like more domain specific language like okay if you work in um if you work in like the one that I've seen come up a bit recently is like for crypto uh there's tokens like like ethereum or f um semantic search might kind of relate that very closely with Bitcoin and if someone's asking a question about ethereum they they probably want an answer that's about ethereum so by having that traditional search component you can specify okay I want this you know I want ethereum not Bitcoin right where's my search might struggle with that um but at the same time you probably kind of want not just traditional but almost like a mix of both um so hybrid search is putting those two together so you're you're basically you're doing your semantic search you're also doing your keyword search and then you're merging those result results and kind of like re-ranking based on you know whether you want more traditional whether you are more semantic so if I would say like it do threshold for whether you should consider like hybrid is you know you try dents um and it doesn't really seem to work and particularly if you know it's that kind of keyword thing where like keywords don't seem to have enough importance then that's why you might want to use a hybrid search where you're using both um yeah so hybrid searching help a lot um what are the other other parts of that question Harrison well well this the other big question um or the other top question is also about retrieval and it's basically we've got a lot of documents um we find that queries return um irrelevant documents causing alums hallucinations exactly what you were saying earlier about wanting to pull down the context and then one thing that they're asking for is basically they want to kind of like set a search distance filter to set off at certain similarities and they're asking about whether this is like a pine cone and like and that's works but like more generally if you reframe this as like yeah how like and I guess this very much ties into like things Beyond semantic search it's like it's like maybe like yeah one maybe specifically is there a specific way to put a cut off on the cosine distance within Pinecone or would you recommend doing that outside of Pinecone and then just like second like hacky things or not hacky things but like tips and tricks like this that are that are good to do yeah so I mean if you're returning like a lot of irrelevant information like the first thing I would look at is the embedding process like are you um are you putting too much text into each embedding that would probably be uh like a one of the common issues so you want to you want to try and split your tapes into smaller chunks and then embed uh and then once you've done that if what you can do on the other side so on the actual retrieval side um within Pine counter isn't a feature for this um but the solution is still pretty straightforward uh you you have the top K parameter which is basically how many items or how many documents you're retrieving let's say usually you'd want to retrieve like five documents um what you can do is just retrieve like and then retrieve like 50 maybe 100 documents and then add that cut off and just take like the top five um so you when whenever you retrieve items from from Pine Cone you're going to get a Samaritan score so you can tweak your your threshold for what you want to let through based on that um that would be kind of like the most straightforward solution another solution that you can go for and this is something that I've seen use like fairly effectively is again retrieving more documents from Pinecone and then using a re-ranking model like after you've retrieved your items so like you're here I have a re-ranking model um also through at least through the sentence Transformers model uh you can get re-ranking models as well they're open source so basically what they will do is they'll just look at like these hundred items and they're just going to re-rank them and they are less slower but they're more powerful than a typical embedding model um so that's why we would only use it on like the last 100 items um but they will basically re-rank everything and you'll you'll get usually high quality results from that as well awesome and Lance you've done a lot of stuff with retrieval do you have any kind of like favorite kind of like retrieval tips and tricks I've got one in mind but I'm I'll let you go first and then I'll add my favorite yeah we've actually um I'll share some some Tweets in the in the chat we've we've had a few different recent retrievers um so let's see um I think well actually maybe you go first I'm gonna I'm gonna pull something up so I can share uh all right my favorite one is one of the ones that we call kind of like self query and so it builds on top of the um it builds on top of the vector store and basically a lot of vector stores including Pinecone I think we actually did first for Pinecone support like metadata filtering um and so when you get like a query or a question some of the query might not actually be about the semantic meaning of it um but about like particular filters that you might want to apply so for example like what's a movie about aliens in the Year 1980 or something like that so like the Year 1980 it's not it's not a semantic thing you want to search on it's like the literal year that you want to filter on and so if you have year as kind of like a metadata attribute you can kind of like split that out and so we kind of like use a language model to split out kind of like that filter from and then pass the semantic bit and create a vector and do kind of like the the vector search that way but then also pass in a filter that we extract um and I think this is um yeah this is this is one of my more uh I like this method a lot I think it's pretty good um and and uh yeah that so that that's my probably like favorite hack on top of on top of just straight Vector search yeah I just added something to the chat so one other thing we didn't talk about this too much but basically persisting metadata with every chunk is a very nice trick because metadata gives you kind of a handle when you're doing the retrieval to fill different things and so we have a number of different ways to persist metadata associated with like for if you're working with documents where did each chunk of the document actually come from um like the introduction or what section of the document likewise we have the same thing for code we have something for markdown files and this ability to persist metadata in the retrieval stage is quite nice and actually plays very well into self queer retriever because then in fact I'll share a I'll share some documentation but basically when you have these metadata tags that came from splitting you can use them with a self query retriever very nicely I'll share some documentation there as well um so that is one trick that I that I really like and that's been very popular in the community awesome here's a here's some really good documentation as well about how that plays in with uh sub query retriever and of course Pinecone is very nice support for metadata filtering so it's um it's um you know it really it really works well uh all right maybe the last question to end on kind of like combining um retrieval and uh and the open source nature is basically around like fine-tuning embedding models um like have you guys played around with this have you seen people who are doing that do you have any tips and tricks for people looking to do that I actually haven't done that at all so I'm just looking to learn from you guys at this point yeah um so find something embedding models actually it's actually not that difficult and doesn't require that much compute comparison fine-tuning a lot of other models um so iron and this is this is like kind of going back a little bit now like when I was actually fine-tuning these models I haven't since you know the whole uh open air I think but like before that time um I worked a lot on just like fine-tuning sentence transform models and there's a lot of different methods depending on your like your data set that you can use even even methods where you don't need to like um label your data generally speaking though you should you should label your data basically what you want to do is you want to get pairs of sentences or paragraphs that are similar or dissimilar and you're just going to feed that into the model and tell it which one similar which ones are not similar um and you can usually get pretty good results if you like the sort of rule of thumb again back then things may have changed a bit now the Royal thumb back downwards like if you have 10 000 of these pairs um you can like you don't need any more than that and then a lot of time you can actually deal actually train a model with much less than 10 000 pairs like five thousand was something that I did fairly often um and training times that we like on a consumer grade GPU doing a couple of hours nice one thing I'll throw I just add something to the chat glean's actually a company that's been pretty interesting in kind of Enterprise retrieval they talk a lot about fine tuning embedding models on your data as kind of a significant impact on quality um and it's kind of interesting to recognize um so I I think it's a great area to think about all right well I want to thank you guys for uh for for helping give this webinar I think uh yeah I think retrieval is always one of the most interesting things to talk about and llama 2 kind of took the World by storm last week so glad we were able to combine the two into one awesome um one awesome webinar and I want to thank everyone for tuning in this this will be available on uh YouTube and um yeah looking looking forward to the next one thank you guys cool thanks guys thanks a lot bye foreign

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from LangChain · LangChain · 5 of 60

← Previous Next →

Chat With Your Documents Using LangChain + JavaScript

Chat With Your Documents Using LangChain + JavaScript

LangChain SQL Webinar

LangChain SQL Webinar

LangChain "OpenAI functions" Webinar

LangChain "OpenAI functions" Webinar

LangSmith Launch

LangSmith Launch

LangChain x Pinecone: Supercharging Llama-2 with RAG

LangChain x Pinecone: Supercharging Llama-2 with RAG

LangChain Expression Language

LangChain Expression Language

Building LLM applications with LangChain with Lance

Building LLM applications with LangChain with Lance

Benchmarking Question/Answering Over CSV Data

Benchmarking Question/Answering Over CSV Data

LangChain "RAG Evaluation" Webinar

LangChain "RAG Evaluation" Webinar

Fine-tuning in Your Voice Webinar

Fine-tuning in Your Voice Webinar

Tabular Data Retrieval

Tabular Data Retrieval

Building an LLM Application with Audio by AssemblyAI

Building an LLM Application with Audio by AssemblyAI

Superagent Deepdive Webinar

Superagent Deepdive Webinar

Lessons from Deploying LLMs with LangSmith

Lessons from Deploying LLMs with LangSmith

Shortwave Assistant Deepdive Webinar

Shortwave Assistant Deepdive Webinar

Cognitive Architectures for Language Agents

Cognitive Architectures for Language Agents

Effectively Building with LLMs in the Browser with Jacob

Effectively Building with LLMs in the Browser with Jacob

Data Privacy for LLMs

Data Privacy for LLMs

"Theory of Mind" Webinar with Plastic Labs

"Theory of Mind" Webinar with Plastic Labs

LangChain Templates

LangChain Templates

Using Natural Language to Query Postgres with Jacob

Using Natural Language to Query Postgres with Jacob

Building a Research Assistant from Scratch

Building a Research Assistant from Scratch

Benchmarking RAG over LangChain Docs

Benchmarking RAG over LangChain Docs

Skeleton-of-Thought: Building a New Template from Scratch

Skeleton-of-Thought: Building a New Template from Scratch

Benchmarking Methods for Semi-Structured RAG

Benchmarking Methods for Semi-Structured RAG

LangSmith Highlights: Getting Started

LangSmith Highlights: Getting Started

LangSmith Highlights: Debugging

LangSmith Highlights: Debugging

LangSmith Highlights: Datasets

LangSmith Highlights: Datasets

LangSmith Highlights: Evaluation

LangSmith Highlights: Evaluation

LangSmith Highlights: Human Annotation

LangSmith Highlights: Human Annotation

LangSmith Highlights: Monitoring

LangSmith Highlights: Monitoring

LangSmith Highlights: Hub

LangSmith Highlights: Hub

SQL Research Assistant

SQL Research Assistant

Getting Started with Multi-Modal LLMs

Getting Started with Multi-Modal LLMs

Build a Full Stack RAG App With TypeScript

Build a Full Stack RAG App With TypeScript

Auto-Prompt Builder (with Hosted LangServe)

Auto-Prompt Builder (with Hosted LangServe)

LangChain v0.1.0 Launch: Introduction

LangChain v0.1.0 Launch: Introduction

LangChain v0.1.0 Launch: Observability

LangChain v0.1.0 Launch: Observability

LangChain v0.1.0 Launch: Integrations

LangChain v0.1.0 Launch: Integrations

LangChain v0.1.0 Launch: Composability

LangChain v0.1.0 Launch: Composability

LangChain v0.1.0 Launch: Streaming

LangChain v0.1.0 Launch: Streaming

LangChain v0.1.0 Launch: Output Parsing

LangChain v0.1.0 Launch: Output Parsing

LangChain v0.1.0 Launch: Retrieval

LangChain v0.1.0 Launch: Retrieval

LangChain v0.1.0 Launch: Agents

LangChain v0.1.0 Launch: Agents

Build and Deploy a RAG app with Pinecone Serverless

Build and Deploy a RAG app with Pinecone Serverless

Hosted LangServe + LangChain Templates

Hosted LangServe + LangChain Templates

LangGraph: Intro

LangGraph: Intro

LangGraph: Agent Executor

LangGraph: Agent Executor

LangGraph: Chat Agent Executor

LangGraph: Chat Agent Executor

LangGraph: Human-in-the-Loop

LangGraph: Human-in-the-Loop

LangGraph: Dynamically Returning a Tool Output Directly

LangGraph: Dynamically Returning a Tool Output Directly

LangGraph: Respond in a Specific Format

LangGraph: Respond in a Specific Format

LangGraph: Managing Agent Steps

LangGraph: Managing Agent Steps

LangGraph: Force-Calling a Tool

LangGraph: Force-Calling a Tool

LangGraph: Multi-Agent Workflows

LangGraph: Multi-Agent Workflows

Streaming Events: Introducing a new `stream_events` method

Streaming Events: Introducing a new `stream_events` method

Building a web RAG chatbot: using LangChain, Exa (prev. Metaphor), LangSmith, and Hosted Langserve

Building a web RAG chatbot: using LangChain, Exa (prev. Metaphor), LangSmith, and Hosted Langserve

Open Source RAG with Nomic's New Embedding Model (and ChromaDB and Ollama)

Open Source RAG with Nomic's New Embedding Model (and ChromaDB and Ollama)

LangGraph: Persistence

LangGraph: Persistence

This video teaches how to supercharge Llama-2 with Retrieval Augmented Generation (RAG) using LangChain and Pinecone, allowing the model to access an external knowledge base through a vector database. The solution utilizes a parametric knowledge solver and semantic search to retrieve relevant information with natural language. By following the steps outlined in the video, viewers can improve the performance and relevance of Llama-2.

Key Takeaways

Embed document texts using Sentence Transformer
Create vector index with Pinecone's vector database
Load LLaMA-2 model and tokenizer using Hugging Face's Library
Create text generation pipeline using LLaMA-2 model
Insert instruction tokens into user query
Use conditional prompt selector to select different prompts

💡 Retrieval Augmented Generation (RAG) can significantly improve the performance and relevance of language models like Llama-2 by allowing them to access an external knowledge base through a vector database.

🔒 Pro feature: Ask AI to explain this lesson →

More on: RAG Basics

View skill →

High Performance (Realtime) RAG Chains: From Basic to Advanced

High Performance (Realtime) RAG Chains: From Basic to Advanced

Coding the Ultimate RAG Engine from Zero

Coding the Ultimate RAG Engine from Zero

Building Agentic RAG From Scratch in Pure Python

Building Agentic RAG From Scratch in Pure Python

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

Akamai Developers

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

Related AI Lessons

Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On

Learn the limitations of linear RAG pipelines and how agentic workflows are becoming a popular alternative for more efficient and effective AI workflows

Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On

Learn why linear RAG pipelines have limitations and how Agentic workflows are becoming a preferred alternative in the industry

Medium · Machine Learning

Why you shouldn’t search your documents directly with AI

Learn why directly searching documents with AI can be inefficient and how retrieval-augmented systems can improve the process

Medium · Programming

Your AI Keeps Making Things Up. RAG Is How You Make It Use Real Facts Instead.

Learn how to use RAG to make your AI provide accurate answers based on real facts instead of making things up

RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python

Professor Py: AI Engineering