Does AgenticRAG Really Work?

MLOps.community · Advanced ·🧠 Large Language Models ·6mo ago

Skills: Multimodal LLMs90%Prompt Craft80%Fine-tuning LLMs70%Agent Foundations60%Tool Use & Function Calling50%

Key Takeaways

The video discusses the effectiveness of AgenticRAG, a system that utilizes retrieval augmented generation (RAG) and large language models (LLMs) to provide contextually relevant results. It explores the limitations of traditional RAG systems and introduces AgenticRAG as a more scalable and contextually relevant solution.

Full Transcript

So, a lot of an agent lies in how dynamically generates a prompt, which is passed to the large language model. For instance, if I can just say, "Hey, this this is my table schema. Generate a SQL query for this." It'll give you a SQL query, but the problem would be it's going to be highly hallucinated. It's been an interesting space like so far, we know. We started with, you know, a very basic, let say, if somebody asks me like like what ML is, for instance, you know, like what is this thing? So, back in if you remember in high school, we would have, you know, these simple equations like we're given two data points X1 Y1 X2 Y2 and you're asked that, you know, okay, what would be the value of Y3 at X3? And that's extrapolation. And essentially, that's what entire ML is. It's just that instead of flying, now we have so complex, you know, data points, so complex uh essentially curves and so complex contour maps. Like it's multi-dimensional. It's not even three-dimensional anymore. We just can't like, you know, visualize it in a human way, but uh there are so many dimensions to it, so much of data that we have. And essentially, it's all about, you know, like being able to predict based on what we know. Uh kind of, you know, finding out the models uh which are mathematical functions that could simulate what we have going on in the real world. So, essentially, what we are coming down to is creating more and more of those complex things. Be it like, you know, image processing for example in our, you know, live video feed for our cars in the Waymo that we see out there. They're able to predict, you know, right now uh when to stop based on, you know, like the distance calculations. And that's all like functions. And how we got here towards this GenAI, which is a great word right now, a huge buzz around it. Interestingly, we used to have uh we started with neural networks. We had like uh CNNs, convolutional neural networks. We had RNNs, uh recurrent neural networks. RNN was RNNs were the ones which were used extensively for NLP, uh the natural language processing. Aim was that how we are talking right now, we should be able to talk to machines, too. And we can understand like, you know, like what context of the any, let's say, website is or any uh what somebody's talking about. But the problem there was uh they did not have much attention mechanism. The attention span was so low that after like a few words, we just did not have the attention window that could go back and relate to what was being said. It's like a goldfish. Yeah. >> [laughter] >> Essentially. Uh and that did improve when LSTMs came into picture, which was long short-term memory neural nets. Uh and then GRUs came into existence. But still, the attention span was so low and that was primarily because of the architecture that we were following. What we were trying to do is we were trying to induce gates in these neural networks, which could ascertain how much of the context for input that is coming through we want to retain and how much we want to forget. Yeah. But those control gates, they were not efficient enough. They were There is no way we could have achieved what we have until and unless the 2017 revolution happened. >> Mhm. And there was this paper you would have heard of, "Attention is all you need." >> Yeah. The moment it came through, the transformer architecture when it was introduced, it just changed the entire game. And here we are now. One thing that I'm thinking about with this model evolution that you just broke down is what the next architecture is going to be. Cuz there's things that happen with transformers that are not things we want like hallucinations. Right. And so, sometimes some people will argue, "Well, that's like a feature, not a bug." And others will say, "Well, you know, like we really want it to be reliable and if you're going to have hallucinations, then it's not going to be reliable, but at the same time it's AI, it's machine learning, it is probabilistic. I I don't know like if hallucinations are the feature, but you're truly said, it's it's all probabilistic models, right? And it essentially depends on like how even the transformer architecture evolved. So, from LSTMs, like as we were just, you know, touching base on that, we essentially found out that, okay, what if instead of, you know, using uh uh these gates to control the attention span, what if we were to actually have some self-attention mechanisms from where these architectures started evolving? We had encoder and decoder. Then we had encoder-only models and decoder-only models. And it started from 2017 and eventually it was it just grew so much. Uh so, for instance, encoder-only models were aimed at just understanding the context of what this particular text is talking about. And we had these query, key, and value vectors. We're using softmax functions for assigning probabilities to these words. And then in the decoder, we wanted to make sure that when it's able to predict, we have some masking so that it's not able to induce a dialect from already pre-learned, you know, words. So that, for example, when me and you we are talking, uh before I register the input, I should not have any bias induced in me. And that was the aim. But the problem is that that bias is somehow induced because humans are also using these systems, where these hallucinations actually come from even more. And since it's a probabilistic model as such, these softmax functions, they produce probabilities right now. But moving ahead in the architectures which we are using in production right now, RAG was a big change. Yeah. Yeah, that was the one >> Yeah, so it wasn't on the model level, it was more on the system level in that way. And so, it went from like, all right, cool, we've got this really important model, but now how do we architect the system around it? And so, last year RAG was all the rage. I think probably the last 2 years RAG was very important. And that was like the next step in the evolution, we could say. We had the ChatGPT moment and then we started playing with it, we started using tools and chaining together prompts and then RAG became very popular and then you moved on to agent RAG or agentic RAG, right? But let's talk about RAG and what you were doing there and why it wasn't enough. Yeah. Uh I think this is a very interesting domain specifically because when the hallucination started to come in, now we have something that understands, you know, uh the context of language. Now, we have a model that wants to talk, but it just doesn't know what to talk about, you know, essentially. Uh and RAG is essentially uh I always envision it in a way that it's like a kid, you know, you're watching him give an exam or her give an exam. And it's an open textbook exam. So, that kid is referring to the books and getting you what you're asking it. But at the same time, we need to check for two things, if it's referring to the right books Yeah. >> when the questions are being asked, as well as like when it's answering, how much of the context that it is giving makes actually sense. Yeah. So, RAG essentially came through that picture where we would ask, let's say, GPT a random question and it would start giving us irrelevant results. Uh semantically correct, syntactically correct, but contextually not so relevant. >> Yeah. Yeah, and especially, I think I remember RAG became very popular just because of the fact that people wanted up-to-date information. And so then it was like, all right, well, we're just going to throw all the most recent information into a vector store and then we'll use that and anytime there's something that comes up, we can search the vector store and get that information. Right. Yeah. And yeah, and and that was a really one of the, I would say, uh genesis of RAG in that sense. And moving on to now like where we are with the agentic AI or that we say RAG agents, it's actually being used way across different contexts for which even it was thought of in the beginning. Mhm. Uh so, as uh uh we were discussing, when RAG started, the aim was to actually ground the LLMs, you know, let them make uh much relevant decisions based on the vector stores that we have. Uh and for example now, the ones that we are implementing, uh and it could be a very generic case study like if I want to talk to my databases, let's say, right? Uh and I need to generate a SQL query. Uh LLM can generate a SQL query, but would it be relevant at all? No. Again, hallucination happened. It doesn't even have a lot of context. And more than that, how can we even make sure because now if I'm talking to a production or staging data, it's risky because what if some user would come in and just add a drop table statement right there? My job is gone. >> tables, yeah. >> [laughter] >> Your job gets more complex, for sure. >> [laughter] >> But so, why did RAG fall over? Like where was why did you switch to agentic RAG and what are the differences between the two? Um So, if I were to make an analogous comparison, so when REST architecture came into existence, we started developing these REST services. I remember we started with SOAP services, then RESTful APIs were the standards. Then we delved into microservices. Basically, contextualizing each service very specific to the use case. So that it's easier to scale, it's much more contextually relevant, and as well as it gives us a much more relevant results with respect to the architecture and reusability of these. And here in Agent AI similarly, when we are we were grounding the rags for like very generic use cases we're getting great results. But now in the multi-agent systems, we are creating those agentic rags for a very specific use case and now these agents are talking to each other. Mhm. Rather than having a one wholesome agent for being >> Okay, I see. So the idea is trying to like break it down into microservices. >> Yes. And say you're an agent that has access to this vector database and another agent can almost like use you as a tool. Yeah. Yeah. >> And so the tool is search and retrieval type tool but on our data in some place. Yeah. Yeah, and that also helps us in the terms of specializing the context of a specific agent. It's a separation of concern. Yeah. And we can have as many layers for security or for enhancing or enriching the data in between. >> Mhm. And it becomes individual bots who are just taking care of these things. And are you making each agent like only give only giving access to one database so it's like that is the marketing database and you can call that agent and it can retrieve everything and then enrich or summarize the answer and then give it back to the main agent. Uh yeah, in somewhat in those terms specifically. So aim here generally is that we um create agents in a way which are like very scalable at the same time uh making sure we have data governance in place. Mhm. Because let's say also we have different dialects of data across different systems. Yeah. And one of the ways that we can implement is using a SQL glot which is a security layer which can transform but at the same time sometimes we that's not desirable because of different systems in place separation of concern being there. Uh so what we do is essentially design very specific bots for very specific use cases. And it also helps in the cost optimization. What if this was not to be used in production but it was to be used for um or when I say production I mean for the outer world but more like you know for internal efficiency of the workforce let's say or onboarding. Yeah. So do we really need those kind of resources to put in into those agents versus the ones which are going to be consumer-facing? Yeah. So these help in making those kind of decisions. >> Yeah, exactly. There's a lot of different trade-offs that you can be okay with I imagine if it's just internally facing. Yeah. And on so many different vectors probably quality and on speed on or or maybe you're like >> [snorts] >> no, we have to get it really fast because it's >> [laughter] >> um but reliability I would imagine is you just have less high of a bar. Yeah. >> If it's internal because the internal user is going to be much more forgiving than the external user. Oh, 100%. Yeah. And hopefully. Hopefully. [laughter] Yeah, exactly. And so then all right, so I'm kind of understanding it. I think the thing that I ask myself if I'm understanding this correctly, you have agents that are able to query databases. Why not just make an MCP server for the database? We possibly could. Mhm. That could be a way to go. But the main context comes through is like what use case are we like serving? Mhm. Essentially. If it's let's say we have to like talk let business talk to a database and they want certain reports in just certain ways, you know. And we want to expose that. Would we want to go through the route of MCP server and create whole another layer to it? Mhm. Um it's at all like optimizing how much resource and information we should put into uh for achieving a specific use case. Yeah. At the end of the day for consumer-facing probably that might make sense. For me what it sounds like is you have different use cases and they're very verticalized. So maybe there's a team or there's a suite of folks that need information and you create an AI product that can do that thing really well. So create dashboards from the sales data or the financial data whatever it may be. And then you have another product and it's in a way separated. And so you have like separation of concerns which is really good but at the same time you have to create a whole new product around it. Is that it? I imagine some of the pieces are going to be reusable and you can say all right, well this is similar. We just need to change this and tweak some prompts. Like as if it was that easy. >> [laughter] >> But and give it access to this database instead of that database. But is is that how it is? It's like each individual product and then you have to upkeep the products for the internal teams? Um yeah, it's almost in that direction specifically because um so definitely we can't like discuss details into the internal architecture but in a in in a very wholesome level that essentially what it boils down to. Um if let's say if I'm building a bot for like one of the teams in Slack which aims at onboarding for instance. Uh and similarly and there is a bot which works towards you know working with different databases so let's say in sales. Uh and these two are going to have some interchangeable components but their vector DBs are going to be different for instance because let's say if it were just creating SQL queries. We don't need something like Vortex AI matching engine or Milvus DB for that matter. We can use something very lightweight like files. Just Facebook similarity search. Still running hard. It's >> [laughter] >> created all those years ago and it still is just amazing how well it works. Yeah. >> [laughter] >> But I I so I understand that. It's uh you choose what you need to use also depending on the use case because the use case almost dictates what kind of necessities you're going to have. Yes. Yeah. So some of it can be oh well, you're we're going to need access to the same databases cuz there's some overlap. But I imagine you're not using the same agents for those. You're creating new agents that because then there could be some like context mixing and that could be bad. Oh yeah. Yeah, that's so true. So the sole of an agent lies in how dynamically generates it prompt which is passed to the large language model. Uh for instance if I I can just say hey this this is my table schema generate a SQL query for this. It'll give you a SQL query but the problem would be it's going to be highly hallucinated. Mhm. It would not know where to join. It would not know how to or which particular columns to join on, you know. And aim is to to it essentially eradicate that middle layer where we have to constantly describe these things. Yeah. Right. So uh then it boils down to okay, how do we create that how do we make sure that our system or our agent creates this dynamic prompt which is passed to the LLM because LLM is going to do its hallucination on its own side for sure. We can't stop that. Yeah, it's like playing telephone. >> [laughter] >> But how are you making the SQL queries then if you're not letting the LLM generate it? Oh, the prompts are actually dynamically generated. SQL queries are definitely generated by the LLMs. Okay. So the prompts which is passed to the large language model, those basically are generated based on the documents which are retrieved by the retriever. >> Oh. And those documents are retrieved in this high-dimensional space through semantic similarity. Excuse me, searching. And that's where the vector DB's role come into so much of picture. Like what kind of vector DBs do you want to use? Yeah. Do we like really want to go with something let's say if I had if I want to search across all the like 4 40 million products of Walmart and I want to like find out the products for let's say all the users. It's humongous data. Yeah. Even creating embeddings for that my gosh, it just boils on the system. >> So expensive I imagine. >> [laughter] >> I'm just I can't even fathom how much data that would be. Yeah. And so then you have to decide what subset of the data you want and then throw it into a vector DB and you're spinning up new vector DBs for all these different use cases. Uh yeah, so we make a choice on like based on what industry standard is being used, why how much is the cost, if do we have an open source solution for it. And sometimes open source solutions are available like Milvus DB great vector DB you know. But the problem is it can have a little bit of high latency even at the same indexes like IVF, PA, or IVFPQ for Vertex AI. But the problem there is if I were to generate, let's say, recommendations for uh my customer base. And if I were to generate them once a day, I can probably use, you know, uh an open-source. Yeah. Why would I want to spend something on uh something, you know, which is really costly for me. >> Yeah. Uh and it's again very use case, you know, specific. But if it were an online serving model, which is oh, I'm generating them every uh half an hour, the pipeline is constantly running, or every 15 minutes, in that case, yeah, I would have to, you know, shell out that cost. Yeah. And uh then that would make sense uh that to use uh very low latent uh and very highly complex indexable uh Vertex, sorry, vector DBs. Yeah. that we can actually use. If you got the chance to just start from scratch and build something, how would you go about it? Okay, let's take an example. Give me an example like what would we need to build. What would we need to build? Ah, is there something Is there something that you feel is is uniquely valuable in the e-commerce space? There are a lot of things. Um I would say the two most prominent examples that come to me is >> So, sometimes when we launch a new product, for instance, um or a new customer experience, we generally go about doing uh multi-armed bandits or AB testing. >> Yeah. But sometimes, uh we already have so much of uh uh good control experience uh and that we don't want to, you know, sway away from because it can cost us potential users. >> Uh-huh. So, uh and if we try out even in multi-armed bandit way, like we use Thompson sampling to sway one or two percent of the users, are we really willing to take that risk? Uh-huh. In that case, you can >> Because it's like you have so much, you don't want to lose what you have. You're you're playing for defense in a way. Yeah. Yeah, it's explore-exploitation trade-off, basically, you know. >> Okay, yeah, that's fascinating to think about that it's not like Yeah, you can't be willy-nilly because you're already so optimized. Yeah. >> [laughter] >> So, then uh in in explore-exploit trade-off that we generally go around here, we think in a way that, okay, uh I'm going to go ahead and uh use my control group, which is how it is right now, but at the same time, and I'm going to exploit it. But I might like explore like 1%. Mhm. It's definitely a cost to the business, but it can yield a lot more. But it's very dynamic in MI MAP, you know, like we'd go about And this is called Thompson sampling, basically, right? So, uh we just sway the users like that. But another another way that we can possibly try is, let's say, we launch something, uh some recommendation model, and we want to try it or direct users to it. Uh but we want to do that like after they have done experiencing the current product, you know. After they paid money. After they paid money. [laughter] The upsell can be something that is experimental. Yeah. So, there like we once I remember used something a a kind of a bot, you know, of, you know, us uh trying out rag agents, essentially. Um that specific use case, for instance, we were actually going through a lot of products, uh a lot of, you know, embeddings from the user data. And we could not have gone with uh something like, you know, lightweight. So, at that time, we did explore very specific uh vector DBs like, you know, uh Vertex AI or you know. uh Milvus DB. Yeah. If a use case is like that, then yes, absolutely, I would go with those uh ones. But if it's a use case something like, you know, I want to generate um seek uh reports for the business, uh I wouldn't have much data of the schema, uh you know, like it's barely in kilobytes, you know. Yeah. And in that case, I would use something very lightweight open-source which is out there. Yeah. All I need to make sure is in that lightweight vector DB, it's able to pick up at the right time, the right context by doing, you know, the right similarity matching. Mhm. Be it cosine similarity, be it uh Pearson correlation, like however it finds. Yeah. Pearson correlation being one of the really interesting things, you know, when we think about uh user reviews and all. Mhm. Now, that's So, that's on the vector DB side. What about just architecting it? Architecting the whole system, you had a blank slate, and you come into a startup, and it's like, okay, sweet, we want to build this product. How do you go about that? Excuse me. Um that would be um very much, of course, reliant on the kind of problem that we are solving, but let's assume if we are solving a problem which comes through the rag agents or generative AI. Yeah. So, uh first of all, um definitely uh it would be what kind of data we have. Um is it a textual data? Is it an image data? Is it like what kind of problem you're trying to solve? Yeah. Um and let's say if it's a textual data, for instance, uh do we have like different context of the data? Like uh do we have different data governance uh uh rules in place? Like do we have different geographical locations in place? Like Europe has like, you know, huge data uh privacy requirements compared to, you know, here. Here and This is a startup. We don't have >> [laughter] >> There's the I imagine that is uh a beautiful thing if you have the data governance rules, and you know everything on also just where the data goes and how you can't do anything with the data unless you follow the processes. Sounds like an amazing place to be in, but I imagine that you get there or you need a team of people to be focusing on that. That's not something that just magically happens. Right. Yeah. That's true. That's true. But essentially, when I'm trying to build up a prototype, I would want it to be scalable to point where uh whenever because those things are going to come through down the road. And I would want to be like prepared with my architecture Yeah. to be able to include that. Oh, interesting. Yeah. So, in that case, I would uh go about, you know, building uh small agents for very specific use cases. And then rerouting the incoming queries based upon what that specific context is. So, that uh the agents like which are catering to very specific data sets, uh the its scalability is maintained at the same time making sure uh that the rerouting that is happening, it produces very optimal that dynamic prompts. Uh-huh. Which can be then used to query any large language model, tune it, or fine-tune it the way that we want it any temperature that we want it to. And that routing happens with an LLM, or that's just a a router? It's like a It it's part of the uh rag agent itself. >> Mhm. So, there's an agent uh without, basically, the last part, which is the generator. So, it's in the uh augmentation part. >> Mhm. Where uh essentially uh we have query coming in, and then we are routing based on the what kind of context uh we want to route that query specifically in. Uh and then specifically querying those specific agents, in a sense, to generate further dynamic LLM prompts that we would use. Are there use cases that you particularly like and have seen like a lot of usefulness with? Yes. Absolutely. Uh so, of course, data analytics is one part, you know. Uh what used to happen before was like business would get back to engineers, and they would be like, okay, we want these kind of reports. Uh of course, they would have their dashboards and like Power BI's would be there, right? Uh but now, they can just straight away talk to these databases, essentially. And that's a huge huge uh win uh for any uh For the data team that doesn't have to service those requests [laughter] anymore, that's for sure. Yeah. Yeah, I remember talking to Donae about this uh cuz she built a data analyst agent, and one thing that she said was the hardest part in building out the agent so that it gave correct answers and it understood the context was they had to build out a whole glossary of terms. Since a lot of what you say when you are speaking to another person, you're using this like lingo, and even if it isn't marketing lingo, it is still fuzzy in the way that our company or our team describes that. So, an MQL in marketing terms is like a marketing qualified lead, and at this company, we describe it someone who has, you know, downloaded the ebook. But at another company, it's not until you download the ebook and you come to a live event, or you reach out to sales because you went to a webinar, and so there's even with the same term, the same word, it's very loaded, and that happens across the board. And so, like, how do you do Did you do like a glossary thing like that? Yeah, I and that that is a very interesting problem to be honest, you know. Even like within the company, you know, where different teams have different lingos for different kind of things sometimes. But yeah, and that's one of the things how I like came over board with that was essentially defining the schema docs which were being parsed to let's say the rag agent that we have been building. So you can define the relationships there or the mapping essentially, which is the gloss glossary, you know. Because as such, the large language model doesn't understand anything. It's just understanding the language, right? We are telling it like okay, what this is, how do we need to do stuff, and it's just creating a very semantically correct, syntactically correct answers for us. But yeah, so a mapping layer is for very specific lingos which can be like mapped to a very thick context-specific terms, that becomes absolutely necessary for that. And actually the other piece I think that could get tricky and I would love to hear how you deal with it is that you get natural language questions like how did we do this quarter? Which is like, what do you mean by that? [laughter] Right. And so maybe the the agent can come back and say like, are you talking about revenue? And it's like, yeah. Are you talking about revenue generated in the whole company or just your team? There's so many variables that when you talk it's not clear. True. And so how do How are you dealing with that that the agent isn't just giving you stuff because it The hardest thing in the world is getting the agent to say, I don't understand. Right? Like it'll just come back with like, oh, here's how we did this quarter, and you're kind of scratching your head like, I don't know if that's actually what I was looking for. >> [laughter] >> So true. And thanks for mentioning that because that really invoked two thoughts in me. Like I would certainly One is I don't know if you've heard of Dr. D. Kai. Yeah, he actually has just recently published a book. It's called Raising AI. Very interesting. It's about how you know, we need to work along with the AI. And he's a professor at Stanford here. Really interesting, but we'll definitely get into it. But the from the technical side of it from abstraction to coming to a very specific use case, that has been the biggest challenge for us at the end of the day. And the reason why a large language model or an agent would never say that I don't know because it works on our confirmation bias at the end of the day. It just wants to answer no matter what. So for specifically for like um one of the ways that I was able to achieve it was contextualizing the prompts that we are developing dynamically, not the ones, you know, like which we write statically. But the ones which are being developed dynamically in these agents. If And that's why that becomes the soul of the agent. Because if it's able to provide from that abstraction, okay, how did we do this quarter? To okay, what do we need? Like revenue, we need sales, we need product sold, and all this information can be put out. So it will give us very context-specific results. And how do we do that? Depends on couple of things. One is how is our data schema and structured. And when we say this quarter, it's just going to pick up that date range. How did we do? We can always map these queries to revenue, product sold, let's say sales, employees' performance. It could be anything. But and that's where very context-specific agents come into picture. If I am let's say business and I'm just focused on the sales of it versus if I have I'm an HR team, I'm working on the employees' performance, I would want to have two different answers to that. And hence even in the reusability of those agents, we would have to like configure them or tweak them in a way. Although we can deploy the same ones and they can constantly ask them, they'll get the result, but we would want them to be very specific and precise to what they are like >> So this is like a little bit of a personalized agent. So it knows that I'm on the HR team, and when I ask about how we did this quarter, it's like employee engagement. Yeah. >> [laughter] >> Essentially, yeah. If I'm on the sales team, it's like how much did we sell? But maybe if I'm on if I'm in the C-suite, I'm looking more at like holistic view of how all the numbers are. Yeah. And if you imagine, let's say we are agents, right? What context do I have? I'm just going to go and look up okay, what documents or what index I have in my DB. This is the query that came in. There is an embedding of this query. Like let's say 0.12, 0.34. I'm going to go and see in the vector space where are like what are the closest points to this? I pick that up. I bring it back. Now those specific points in the semantic search, they could mean very different things for different agents. But those points which are suspended in there, that depends on of course the embedding models that we are using. And because we cannot perform, you know, semantic search when coming from different embedding spaces. Specifically because what happens is it's a way to represent our textural pictorial or any non-linear data in a numerical form essentially. And when we are picking it up from the schema docs, let's say, or from any documents that we are feeding into the rag agent the schema docs define essentially what context or which, let's say, tables to pick the data from. If I'm picking up from let's say a sales table, it's going to give me more context around sales. If I'm picking up from let's say employees' performance table it's going to give me more about that. So inherently as an agent Wait, sorry. I missed I missed that. It was from vector space that it gives you that information? Yes, because those vectors are essentially mappings into numerical terms of what these, documents or schema So you're enriching the schema with the vector with basically vector space. Yeah. Yeah. Essentially. >> Okay. So it's like going into this, you know, multiverse and then just finding out okay, I'm just need to pick something up, but I don't know what it's going to map to. Because the mapping has been done basically based on what it was fed in before. And that what was fed in is very much dependent upon what kind of relationships we have defined in before. How the index has been created. And that index creation essentially happens as a first phase of this agentic rag. >> Wow. Okay, that's super cool. I haven't heard the adding a little bit of extra context so that it understands and it's almost like you're saying grounding the information. Again, it's going back to like rag was all about grounding the models, and now we're grounding the agents with a little bit of extra vector space semantic Yes. vectors and all that stuff. Tell me more about these dynamically created prompts. How does that work? Okay, let's take an example and we'll go back to our example. How did we do this year? Or this quarter, right? Um and let's say if I have three tables. Or let's say we have like yeah, three tables. We have a sales table. We have orders table. We have a let's say products table. Now in the schema docs which are passed to the rag agent from where the index is being created we are defining that okay, these are the sales. We tie it to the products and as a key, and there's a key for sales, products, and third one was Let's say customers for instance, right? So aim is like how did when we say like how did we do this year? It would be based upon we can tell like what customers bought like let's say this quarter, or what products were sold, and and how much revenue was generated, right? So that part where we are defining this essential schema of the tables in the docs and the relationships inside the docs, when the embeddings are being generated those embeddings are also encapsulating those semantics of the relationships between these three different tables. And when I ask this, and if I this is the only information I have, I have not told rag to do anything else. When I ask like this quarter, picks up the time range finds out in that time range about like what all data it can fetch from these tables, and then we fine-tune like how the generation of these dynamic prompts are going to be in retriever methods and during our indexer as well. And then from there we can actually be very specific about like what we want in the reports outside. Interesting. So you're getting it. I'm trying to think about like the step-by-step nature of this. You're getting the query, how do we do this quarter? There is a model that receives that. It also will go and search vector space. And then it outputs a prompt to go for another agent to go and use. Yes. Uh-huh. >> Yes. Yeah. And in the output, it's where you're very specific because you have the information of all right, this is whatever this is the relationship between these three tables. Here's what I want you to look for specific agent. Here's what you need to focus on and then go and find that come back and and then after it finds that it comes back and it tells that master agent, here's what I got. Yeah. Yeah. And then the master agent will have all of this in its input and then summarize it and then output something. Yeah, and it could have even used that to let's say retrieve more do more. >> Uh-huh. So, like we have this output now and we are of course going to have these evals in between, right? Uh okay, let now let's say uh we want to relate it to some other set of databases and then the other agent picks it up and goes and creates a dynamic prompt based on that. So, it's multi-agent system just communicating with each other at the same time making sure they're very contextually relevant to their own specific set of uh questions. But it's all happening because we are able to create very relevant uh dense embeddings in these vector spaces. And let's say a user query comes in that is also actually uh put in the vector space or projected in the vector space that being the right word. And then from there the pick it the pickings are off like what is the relevant context that can match this query? So, we don't even have to manually define a lot of things for rag agents until unless absolutely necessary that it's not picking up which is a very uh I would say uh to and fro process sometimes and that's why prompts are not very successful uh in getting uh large language models grounded. Whereas rag agents, they introduce this additional step of suspending in this generated index our relevant information. Query comes in, it's projected. Now we find the match. We come back, we generate the that dynamic prompt now with that context. And that gives us the exact results that we want. Yeah, it gives you a much more rich field to play from. It's so much more enriched with all of that information as opposed to just my simple words of like, how did we do this quarter? Yeah. Yeah, essentially. Yeah. Like I guess I'm lost on how putting the query into vector space right? And then seeing what it is semantically similar to how does it already have all of this stuff that it's semantically similar to? Make sense. So >> Yeah, all right. I was hoping it did, but I didn't know cuz I was [laughter] confusing myself for a second there to be honest. No, no, no. That that's a very valid actually point because what happens is uh uh so before even we started asking our agent any questions, what we did was we built an index. Those index was built on some documents. And documents is a very abstract term for any set of data that we would use. Now imagine a three-dimensional space where we just have like X1, X Y1 or Z1. And uh we have essentially put just the data of our two tables or even for that matter this conversation. We let's say we want to go back and review this conversation. Somebody wants to ask questions about this. So, we have created a kind of you know, analog for this. And we what we have done is we have actually created embeddings from this from different let's say each question and each answer represents one document. Now when we are creating embeddings for these different documents, we are suspending these in the vector space. Now let's say if somebody comes up and asks, okay, uh when Dimitri asked this, what was the answer or did these two questions make sense? This query is essentially going to go in in the same vector space and embeddings are going to be generated using the same model using which the index was created from our conversation before. Then it's going to go ahead and perform that semantic similarity search. And it can you we can tweak it to use any similarities. It could be Euclidean, cosine, Pearson any similarity. And then it finds the closest matching similarities and we can also define how many neighbors we want to uh for it to find the similarity to. And that's like a hyperparameter that we can tune. And let's say if we include too much of a context that it can actually waver off and too little of a context it can also waver off. Yeah. But uh generally for example for me when I did in in the rag agents, five was a five nearest neighbors was a really good approximate nearest neighbor search. Uh it was a small use case essentially. And yeah, so in that case when the new user query comes in, it's suspended in the same space and then it performs that semantic similarity search and gives us back that dynamic prompt which is passed to the large language model. And you're always updating it with the new queries. You're always adding the new queries to the vector space? Yeah, so that's so what happens let's say if I did not get the results that I wanted. I'm going to go back to my docs. I'm going to check like did I define the schema correctly? What went wrong? Why wasn't I able to like you know, find a very relevant why wasn't it able to find a relevant context? And I'm definitely going to update those docs. I'm going to define another doc which defines the schema or relationships between these questions. So, I'm essentially defining these things prior to make sure the that our large language model, whatever answer it gives, it's grounded. It comes from what exactly we want it to do. How are you dealing with the problem of just having too much data and messy data? Um fortunately um the use cases that we have worked with so far the data was wasn't too messy. Uh-huh. But let's say uh we encounter uh for example, we were taking talking about the one of our conversations. If it's a let's say a uh a rag agent which is talking to databases, it is going to be very straight up. It's going to be schema of the database. There's going to be like type of the columns that we have. Um and what relationships between those tables are there. And very straight up. No nothing, you know, which can can't be inferred essentially. But in a normal context of things, we will definitely have to make sure that we have proper data pre-processing in place where we're cleaning up the data. We are removing any unnecessary tags which because we wouldn't want our uh LLM to focus on very unnecessary information like any Yeah. random, you know, It overweights like one word and you're like, why? Why did you care about that word? It's not that big of a deal. Yeah. Yeah. Mhm. And this essentially also comes down to the our previous initial discussion about how transformer architecture essentially works. Uh it's uh essentially assigning these probabilities to all these words and doing the MLM which is masked language modeling, predicting the next word. And if you would want certain words to be up weighed more, you would have to make sure that, you know, for a very specific use case, we eliminate the parts which are not at all like desired or required. >> Yeah. As such. Well, let's say that some data becomes stale because for some reason or another you don't have the same policy anymore or you don't have this you realize that oh, this data actually was incorrect and so we need to change it out. How are you going about swapping things because I've heard that is a real pain in the butt when it comes to keeping your vector database up to date. Yeah. That is actually a big challenge for sure. One of the ways would be to and it's definitely going to be a costly process. Huh. To re-index our database. Oh, interesting. >> To create the index again. Um and as the data increases, indexing it becomes harder >> Yeah. and harder. Yeah, if you only want to change one Yeah. file. [laughter] It's like we got to re-index what? Yeah. I wonder though like uh in that case, how would we essentially go about it? Um One of the ways could be inducing a negative example for something that we don't want to include in our rag agents. Huh. For example, let's say if certain part of the data became stale we can always include a relationship in the documents which is like if asked for this specific data uh the like this is a stale data, we would not want it to go there. So, even though uh in the semantic search, it might be picked up the good part would be when it goes to the large language model, it is going to reject it immediately because in our prompt we're specifying not to actually answer in that direction. But that's more of a sanity check or a smoke test, you know. And it's also okay to do if there's one or two pieces of data, but if it's 20 No. No, you can't you can't do that so easy. Yeah, it would not be as scalable, but that's a that's a really good problem. I'm just thinking cuz I heard the example of you have your HR handbook and certain policies get updated. And then what do you do with when someone's going and asking about their HR questions? It's getting information from which handbook, which policies are getting referenced here. So, this was back in the rag days and I just remember people talking about how hard it was to keep their vector databases Tidy. Yeah. And up to date. And so I I find it interesting. For you, it's it's almost like maybe you're not having to replace as much data, so then you don't have that problem. You're only creating new data and then you can just use the time or the date created filter type thing. Yeah. Yeah, for most of our use cases that we have like worked with uh we don't have like a lot of data which needs to be uh either removed or replaced. But if it were the case, I would believe that re-indexing would definitely solve the problem. >> Mhm. Uh but with the scale of the data uh I'm sure like we can always uh re-index like certain parts of the vector DB as well. Yeah. Uh by for example, just using like one document which is the part we need to like you know update. So, those query would be again suspended or projected in that vector space, but with different values essentially. Um and any negative examples could help us to uh not entertain our previous policies. Um but that is definitely a very interesting problem and I think we'd have to look more with detail to that. You don't do anything with LLMs and recommenders, do you? Are you Are your recommender systems all very traditional still? No, we we we do use. Really? >> Yeah, yeah, yeah. >> How are you using LLMs for recommenders? So, so for instance, let's say if you're we have some recipe recommendations and I want to like you know generate a recommendations like based on that page. So, and I would want to like extract out, you know, like let's say cooking tools relating to certain recipes. Just giving an example. And that way we it's may way easier to contextualize it using a large language model. But you're just contextualizing it and then you're going to the traditional recommender system style or It's a hybrid. >> Yeah. Hybrid basically. I've heard about this a few times from people. It's like you slap a LLM on top of a recommender system just to make that part uh so much easier cuz you don't have to train this model on the whole cooking utensils and everything. [laughter] Yeah, I mean fortunately for us, you know, our users don't scroll Walmart, you know, page as often as Instagram. So, that definitely helps us to have a little bit of latency or like you know, specifically for the projects that I have worked with, you know. >> Yeah. Um they have uh not as much of an online serving model. But uh in that case, yeah, like latency is like of the utmost importance right now. But like for the use cases that I have worked with uh and not to like delve into too much of details for uh the company's confidentiality, but uh yeah, so we have like used like large language models. Uh in essence trying out like you know, uh like what features can we extract out from the current recommendations? Mhm. So that we can use those features uh to recommend next set of things uh immediately. And uh at the same time uh combining the user feedback to make sure how they are essentially liking and what else they will like in that sense. So, because when we are uh essentially taking a natural text uh and trying to get the feedback out of an item or uh essentially a description of an item which user is following set of instructions. It there are so many products that could be relevant. And if one product has captured any user's attention right there, we have user engagement. The top of the funnel is already established. >> Mhm. Now, uh use we have user awareness and we can essentially use the top of the funnel uh again like a little bit of marketing term that we can essentially use that to leverage what else can we bring into awareness. As the user awareness increases for the products which are related essentially engagement increases. Engagement increase leads to potential sales. Yeah. And uh they definitely come back. And for example, we were working with how our YouTube banner impressions which have led to the sales of the items and there is no way for us to relate. Oh, wow. >> Yeah, and yeah, sorry. No, no, no. Okay, so so really you're generating the feature or you're extracting the features with the LLM? Yeah. Yeah. And the features but the features are already in some kind of feature store? Um then this is very dynamic. Because what we're focusing on is what user is essentially looking at. Yeah. And when we find out let's say, for example, there's a set of recipe that user is essentially looking at. What products can be mapped to this recipe? That could be gotten from the description of the recipe. And that is [clears throat] very user specific. Now we know for a fact it's going to be a little bit uh slow for an online serving. I mean, the latency is going to be high. But we do have that user's data. Uh-huh. We do know like what we can essentially further diversify the recommendations when the user logs in next time or even at that time. >> Oh. I see. >> matter. Oh, so you're using the LLM to enrich the recommendations. Yes. I understand. So, the LLM took me a while, but I got here. >> [laughter] >> No, it it was it was a random experiment somehow, you know, and it turned out to be like pretty fascinating. You know, okay, wow, like this works, you know. Yeah, [laughter] yeah, yeah. Oh, I really like that. So, all of the pages that I'm looking at on Walmart you can extract, especially if it's any type of a blog type of post, like a recipe, you can extract everything from that page using an LLM and then enrich my customer profile with all that data.

Original Description

Satish Bhambri is a Sr Data Scientist at Walmart Labs, working on large-scale recommendation systems and conversational AI, including RAG-powered GroceryBot agents, vector-search personalization, and transformer-based ad relevance models. From Quantum Clouds to RAG Agents: Building AI Systems That Scale and Inspire, Satish Bhambri // MLOps Podcast #351 Join the Community: https://go.mlops.community/YTJoinIn Get the newsletter: https://go.mlops.community/YTNewsletter MLOps Merch: https://shop.mlops.community/ // Abstract "The MLOps Community Podcast features Satish Bhambri, Senior Data Scientist with the Personalization and Ranking team at Walmart Labs and one of the emerging leaders in applied AI, in its newest episode. Satish has quietly built one of the most diverse and impactful AI portfolios in his field, spanning quantum computing, deep learning, astrophysics, computer vision, NLP, fraud detection, and enterprise-scale recommendation systems. Bhambri's nearly a decade of research across deep learning, astrophysics, quantum computing, NLP, and computer vision culminated in over 10 peer-reviewed publications released in 2025 through IEEE and Springer, and his early papers are indexed by NASA ADS and Harvard SAO, marking the start of his long-term research arc. He also holds a patent for an AI-powered smart grid optimization framework that integrates deep learning, real-time IoT sensing, and adaptive control algorithms to improve grid stability and efficiency, a demonstration of his original, high-impact contributions to intelligent infrastructure. Bhambri leads personalization and ranking initiatives at Walmart Labs, where his AI systems serve more than (5% of the world’s population) 531 million users every month, roughly based on traffic data. His work with Transformers, Vision-Language Models, RAG and agentic-RAG systems, and GPU-accelerated pipelines has driven significant improvements in scale and performance, including increases in ad engagement, faster

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from MLOps.community · MLOps.community · 0 of 60

← Previous Next →

Our 1st MLOps Meetup // Luke Marsden // MLOps Meetup #1

Our 1st MLOps Meetup // Luke Marsden // MLOps Meetup #1

MLOps.community

Remote Collaboration as a Data Scientist

Remote Collaboration as a Data Scientist

MLOps.community

MLOps Manifesto with Luke Marsden from Dotscience

MLOps Manifesto with Luke Marsden from Dotscience

MLOps.community

MLOps lifecycle description

MLOps lifecycle description

MLOps.community

What Does Best in Class AI/ML Governance Look Like in Fin Services? // Charles Radclyffe // MLOps #2

What Does Best in Class AI/ML Governance Look Like in Fin Services? // Charles Radclyffe // MLOps #2

MLOps.community

Life purpose and too many spreadsheets

Life purpose and too many spreadsheets

MLOps.community

Explainability, Black boxes and EU white paper on reproducibility

Explainability, Black boxes and EU white paper on reproducibility

MLOps.community

Hierarchy of Machine Learning Needs // Phil Winder // MLOps Meetup #3

Hierarchy of Machine Learning Needs // Phil Winder // MLOps Meetup #3

MLOps.community

Automatically Retrain Machine Learning Models? Are best practices worth it?

Automatically Retrain Machine Learning Models? Are best practices worth it?

MLOps.community

Building an MLOps Team? Key ideas to keep in mind

Building an MLOps Team? Key ideas to keep in mind

MLOps.community

Hierarchy of MLOps Needs

Hierarchy of MLOps Needs

MLOps.community

Bare necessities for getting an ML model into production

Bare necessities for getting an ML model into production

MLOps.community

MLOps and Monitoring

MLOps and Monitoring

MLOps.community

How Phil Winder got into Data Science and Software Engineering

How Phil Winder got into Data Science and Software Engineering

MLOps.community

Provenance and Reproducibility in Machine Learning; what is it and why you need it?

Provenance and Reproducibility in Machine Learning; what is it and why you need it?

MLOps.community

Friction Between Data Scientists and Software Engineers

Friction Between Data Scientists and Software Engineers

MLOps.community

MLOps Problems in different size companies

MLOps Problems in different size companies

MLOps.community

ML tooling in large companies

ML tooling in large companies

MLOps.community

ML Platforms - The build vs buy question

ML Platforms - The build vs buy question

MLOps.community

ML Services Gateway at SurveyMonkey

ML Services Gateway at SurveyMonkey

MLOps.community

Message buses, Async and sync architecture

Message buses, Async and sync architecture

MLOps.community

MLOps #4: Shubhi Jain - Building an ML Platform @SurveyMonkey

MLOps #4: Shubhi Jain - Building an ML Platform @SurveyMonkey

MLOps.community

Hybrid Data Science Teams @SurveyMonkey

Hybrid Data Science Teams @SurveyMonkey

MLOps.community

How do you handle ML version control at SurveyMonkey

How do you handle ML version control at SurveyMonkey

MLOps.community

Doing ML with Personal Information

Doing ML with Personal Information

MLOps.community

Evolution of the ML feature store @SurveyMonkey

Evolution of the ML feature store @SurveyMonkey

MLOps.community

Developing a Machine Learning Feature Store

Developing a Machine Learning Feature Store

MLOps.community

Auto retrain ML models is not the question

Auto retrain ML models is not the question

MLOps.community

3 key parts to Machine Learning monitoring

3 key parts to Machine Learning monitoring

MLOps.community

MLOps Meetup #6: Mid-Scale Production Feature Engineering with Dr. Venkata Pingali

MLOps Meetup #6: Mid-Scale Production Feature Engineering with Dr. Venkata Pingali

MLOps.community

MLOps meetup #5 High Stakes ML: Active Failures, Latent Factors with Flavio Clesio

MLOps meetup #5 High Stakes ML: Active Failures, Latent Factors with Flavio Clesio

MLOps.community

MLOps: Airflow Pros and Cons

MLOps: Airflow Pros and Cons

MLOps.community

Specific challenges in Machine Learning

Specific challenges in Machine Learning

MLOps.community

Current State Of Machine Learning

Current State Of Machine Learning

MLOps.community

Humans in the Loop are a defining factor in Machine Learning

Humans in the Loop are a defining factor in Machine Learning

MLOps.community

Learning from real life Machine Learning failures

Learning from real life Machine Learning failures

MLOps.community

Survivorship Bias in machine learning tutorials

Survivorship Bias in machine learning tutorials

MLOps.community

Swiss Cheese model in Machine Learning

Swiss Cheese model in Machine Learning

MLOps.community

Resume driven development in Machine learning & software engineering

Resume driven development in Machine learning & software engineering

MLOps.community

Who has the highest standards in ML?

Who has the highest standards in ML?

MLOps.community

Venkata Pingali of Scribble Data Thoughts on the Current State of Machine Learning

Venkata Pingali of Scribble Data Thoughts on the Current State of Machine Learning

MLOps.community

Dependable data and being able to Trust in your Data with Venkata Pengali of Scribble Data

Dependable data and being able to Trust in your Data with Venkata Pengali of Scribble Data

MLOps.community

Speed, Trust, Evolution and Scale in MLOps

Speed, Trust, Evolution and Scale in MLOps

MLOps.community

More difficult transition for data scientists to become ML engineers

More difficult transition for data scientists to become ML engineers

MLOps.community

How many models in prod til I need a dedicated ML platform?

How many models in prod til I need a dedicated ML platform?

MLOps.community

Deeper thinking from data scientists around platform blackholes

Deeper thinking from data scientists around platform blackholes

MLOps.community

Checkpointing, metadata, and confidence in your data

Checkpointing, metadata, and confidence in your data

MLOps.community

Adjacent usecases and multistep feature engineering

Adjacent usecases and multistep feature engineering

MLOps.community

Standardization of Machine Learning tools like in Software Engineering with Venkata Pingali

Standardization of Machine Learning tools like in Software Engineering with Venkata Pingali

MLOps.community

Reproducability flaws in end to end Machine Learning debugging

Reproducability flaws in end to end Machine Learning debugging

MLOps.community

3rd wave of data scientists

3rd wave of data scientists

MLOps.community

MLOps meetup #7 Alex Spanos // TrueLayer 's MLOps Pipeline

MLOps meetup #7 Alex Spanos // TrueLayer 's MLOps Pipeline

MLOps.community

MLOps Meetup #8 Optimizing Your ML Workflow with Kubeflow 1.0

MLOps Meetup #8 Optimizing Your ML Workflow with Kubeflow 1.0

MLOps.community

Are Kubeflow and Airflow complementary?

Are Kubeflow and Airflow complementary?

MLOps.community

Why Kubeflow gained so much traction=open community

Why Kubeflow gained so much traction=open community

MLOps.community

Who decides the dirrection of Kubeflow

Who decides the dirrection of Kubeflow

MLOps.community

What do Kubeflow and Arrikto do and how do they work together?

What do Kubeflow and Arrikto do and how do they work together?

MLOps.community

Versioning your ML steps with Kubeflow

Versioning your ML steps with Kubeflow

MLOps.community

Machine Learning Lifecycles//Perception vs Reality

Machine Learning Lifecycles//Perception vs Reality

MLOps.community

Kubeflow vs SageMaker in Machine Learning

Kubeflow vs SageMaker in Machine Learning

MLOps.community

The video discusses the effectiveness of AgenticRAG, a system that utilizes retrieval augmented generation (RAG) and large language models (LLMs) to provide contextually relevant results. It explores the limitations of traditional RAG systems and introduces AgenticRAG as a more scalable and contextually relevant solution. The system uses dynamic prompt generation, vector databases, and semantic similarity search to provide improved results.

Key Takeaways

Build a vector database for efficient data retrieval
Create dynamic prompts using prompt engineering
Utilize large language models for contextualization
Use semantic similarity search for improved results
Fine-tune LLMs for specific tasks
Implement retrieval augmented generation for contextually relevant results

💡 AgenticRAG provides a more scalable and contextually relevant solution compared to traditional RAG systems by utilizing dynamic prompt generation, vector databases, and semantic similarity search.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Multimodal LLMs

View skill →

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

AI Tool Journey

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Developer

Building Multimodal Search and RAG

Building Multimodal Search and RAG

Midjourney Trick: Consistent Character in Different Images

Midjourney Trick: Consistent Character in Different Images

Ollama Multimodal: EASILY setup Llava locally & Integrate API

Ollama Multimodal: EASILY setup Llava locally & Integrate API

The ONLY Real Time Speech AI that can run locally!!!

The ONLY Real Time Speech AI that can run locally!!!

Related AI Lessons

Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss

Learn how to accelerate AI workflows with on-device semantic search using Moss, achieving sub-10ms response times and improving user experience

Medium · Machine Learning

Stop Guessing: Guaranteed Structured Output from LLMs in Node.js

Learn to guarantee structured output from LLMs in Node.js and stop parsing JSON manually

Dev.to · Hardik Mehta

Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)

Build a REST endpoint with Spring Boot 3 and OpenAI to create an LLM-powered API, leveraging the power of AI in your applications

Notes: Memory, Context, and Large Language Models (LLMs)

Learn how memory and context work in Large Language Models (LLMs) and potential improvements

Dev.to · Vladimir Panov

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)