RAG is a hack - with Jerry Liu of LlamaIndex

Latent Space · Beginner ·⚡ Algorithms & Data Structures ·2y ago

Skills: LLM Foundations90%RAG Basics90%LLM Engineering80%Prompt Craft70%Vector Stores60%

In October 2022, Robust Intelligence hosted an internal hackathon to play around with LLMs which led to the creation of two of the most important AI Engineering tools: LangChain 🦜⛓️ and LlamaIndex 🦙 by Jerry Liu, which we’ll cover today. In less than a year, LlamaIndex has crossed 600,000 monthly downloads, raised $8.5M from Greylock, has a fast growing open source community that contributes to LlamaHub, and it doesn’t seem to be slowing down. Full show notes and transcript here: https://www.latent.space/p/llamaindex#details 00:00:00 Introductions and Jerry’s background 00:04:38 Starting LlamaIndex as a side project 00:05:27 Evolution from tree-index to current LlamaIndex and LlamaHub architecture 00:11:35 Deciding to leave Robust to start the LlamaIndex company and raising funding 00:21:37 Context window size and information capacity for LLMs 00:23:09 Minimum viable context and maximum context for RAG 00:24:27 Fine-tuning vs RAG - current limitations and future potential 00:25:29 RAG as a hack but good hack for now 00:28:09 RAG benefits - transparency and access control 00:29:40 Potential for fine-tuning to take over some RAG capabilities 00:32:05 Baking everything into an end-to-end trained LLM 00:35:39 Similarities between iterating on ML models and LLM apps 00:37:06 Modularity and customization options in LlamaIndex: data loading, retrieval, synthesis, reasoning 00:43:10 Evaluating and optimizing each component of Lama Index system 00:49:13 Building retrieval benchmarks to evaluate RAG 00:50:38 SEC Insights - open source full stack LLM app using LlamaIndex 00:53:07 Enterprise platform to complement LlamaIndex open source 00:54:33 Community contributions for LlamaHub data loaders 00:57:21 LLM engine usage - majority OpenAI but options expanding 01:00:43 Vector store landscape 01:04:33 Exploring relationships and graphs within data 01:08:29 Additional complexity of evaluating agent loops 01:09:20 Lightning Round

What You'll Learn

The video discusses RAG (Retrieval Augmented Generation) and its application in LLMs (Large Language Models), with a focus on LlamaIndex, a toolkit for building and optimizing RAG systems.

Full Transcript

[Music] hey everyone welcome to the laden space podcast this is alesio partner and CT on residents and deel partners and I'm joined by my co-host swix founder of small Ai and today we finally have Jerry Le on the podcast hey Jerry hey hey guys hey it's wo thanks for having me it's so weird because we keep running into each other in San Francisco AI events so it's kind of weird to finally just have Aver ation recorded for everybody else yeah I know I'm really looking forward to this sorry for the questions um so I tend to introduce people on their formal background and then um ask something on the more personal side so you are part of the Princeton gang um yeah I I don't know if there is like an official official prin prining gang i i attended your meeting there was like four of you oh cool okay nice with Prem and and the others oh yeah yeah yeah yeah um where you did bachor in CS and certificate in finance that's also fun um also uh did finance and I think I saw that you also interned at two Sigma where I worked in New York mhm you were a machine learning engine yeah very briefly oh cool all right I didn't know that okay uh that was my first like proper engineering job before I went into Devo oh okay oh wow nice um and then you're machine learning engineer at Kora um AI research scientists at Uber for three years and then two years machine learning engineer at robust intelligence before starting llama index so that's your LinkedIn what's not on your LinkedIn people should know about you I think back during my quora days um I had this like three Monon phase where I just wrote like a ton of quora answers and so I think if you look at my my tweets nowadays you can basically see that as like the V2 of my my three-month like cortin where I just like went Ham on Kora for a bit um I actually I think I was back then actually uh when I was working on Kora I think the thing that everybody was fascinated in was um just like General like deep learning advancements and stuff like Gans and and generative like images and and just like new architectures that were evolving and it was a pretty exciting time to be a researcher actually because you were going in like really understanding some of the new techniques so I kind of used that as like a learning opportunity basically just like read a bunch of papers and then answer questions on Kora uh and so you can kind of see traces of that basically in my current Twitter where it's just like really about kind of like framing Concepts and trying to make it understandable and educate other users on it yeah I've said so a lot of people come to me for my Twitter advice but like I think you are doing one of the best jobs in the Twitter uh just explaining Concepts and just consistently getting hits out thank you um and I didn't know it was due to the cor training um this a side note on Kora a lot of people including myself like kind of rol cor as like one of the web 1.0 like sort of question answer forms but uh now I think it's becoming Senor Resurgence obviously due to PO um and obviously Adam dangelo has always been a leading Tech figure but what do you think is like kind of underrated about Kora I really like the mission of Kora when I when I joined um in fact um I think when um I interned there like in 2015 and I joined fulltime in 2017 one is like they had and and they have like a very talented engineering team um and and just like really really smart people and the other part is the whole mission of the company is to just like spread knowledge and to educate people um right and and to me that really resonated I really like the idea of just like education and democratizing the flow of information and if you imagine like um kind of back then it was like okay you have Google which is like for search but then you have Kora which is just like user generated like Grassroots type content and I really like that concept because just like okay there's certain types of information that aren't accessible to people but you can make accessible by just like surfacing it and and so actually I don't know if like most people know that about like quora like and and if they've used the product whether through like SEO right or or kind of like actively but that really was what drew me to it yeah I I think most people challenges with it is that sometimes you don't know if it's like a veiled product pitch right yeah like you know of course like quality of the answer matters quite a bit and then five Alternatives and then here's the one I work on yeah like recommendation issues and and all that stuff I used worked on rexus at actually so so I well I mean I kind of more approached it from um machine learning techniques which might be a nice segue into rag actually a lot of it was just information retrieval we weren't like solving anything that was like super different than what was standard in the industry at the time but just like ranking based on user preferences um I think a lot of cor was very metric striven so just like trying to maximize like you know daily active hours like you know um try time spent on site those types of things um and all the machine learning algorithms were really just based on embeddings um you know you have a user embedding and you have like item embeddings and you try to train the models to try to maximize the similarity of these uh and it's basically a retrieval problem okay so you've been working on rag for longer than most people think well kind of so so I worked there for like a year right transparently and then I worked um at Uber where I was not working on ranking it was more um like kind of deep learning training for for self-driving and computer vision and that type of stuff um but I think yeah I mean I think in the llm world it's kind of just like a combination of like everything these days I mean retrieval is not really llms but like it's it's it fits within the space of like llm apps um and then obviously like having knowledge of the underlying deep learning architectures helps having knowledge of basic software engineering principles helps too um and so I think nice it's kind of nice that like this whole LM space is basically just like comination just like a bunch of stuff that you probably like people have done in the past it's good it's like a summary Capstone project yeah exactly yeah um yeah and and before we dive into um llama index what do they feed you a robust intelligence that both H and Harrison from blank chain came out of it at the same time was there like yeah is there any fun story of like how both of you kind of came out with kind of like core infrastructure to LM work close today or how close were you at robust like any any fun behind the scenes yeah yeah we um we we work pretty closely I mean we were on the same team for like two years I got to know harrisona team pretty well I mean I high respect the people there the people there were very driven very passionate and it definitely pushed me to be you know a better engineer and leader and those types of things um yeah I don't really have a concrete explanation for this um I think it's more just we had like an LM hackthon uh around like September uh this is just like exploring gbt 3 or it was October actually and then the day after I went on vacation for a week and a half um and so I just didn't check slack or anything um came back saw that Harrison started Ling train I was like oh that's cool I was like I'll play around with that l a bit and then hacked around on stuff and I think I've told the story a few times but you know I was like trying to feed in information into uh gbt 3 and then then you deal with like context window limitations and there was no tooling or really practices to try to understand how do you you know get gbt 3 to navigate large amounts of data and that's kind of how the project started really was just one of those things where uh early days like we were just trying to build something that was interesting um and not really I like I wanted to start a company um I had other ideas actually of what I wanted to start um and and I was very interested in for instance like multimodel data like video data and that type of stuff uh and then this just kind of grew and eventually took over the other idea text is the universal interface I think so I think so I actually think once the multimodal models come out I think there's just like mathematically nicer properties if you can just get like join multimodal embeddings like clip clip style um but how like Tex is really nice because from a software engineering principle it just makes things way more modular you just convert everything into text and then you just represent everything as text yeah I'm just explaining retroactively why working on llama index took off versus if you had chose to spend your time on multimodel we probably wouldn't be talking about whatever you you ended up working on yeah that's true it's it's struggled um yeah I think so um so interesting so um nove so November 9th so that was a very productive month I I guess October November November 9th you announced GPT tree index and you picked the tree logo very very very cool everyone every project must must have an emoji yeah yeah that that probably was somewhat inspired by a lang train but I I will admit yeah it uses GPT to build a Knowledge Tree in a Bottoms Up Fashion by applying applying a summarization prompts for each node y um which I I like that original Vision um you your your messaging roundabout then was also that you're creating optimized data structures mhm um how like what's the sort of Journey to that and like how does that contrast with L index today yeah so okay maybe I can tell a little bit about like the beginning intuitions um I think when I first started this really wasn't supposed to be something that was like a toolkit that people use it was more just like a system um and the way I wanted to think about the system was more a thought exercise of how language models with their reasoning capabilities if you just treat them as like brains can organize information and then Traverse it so I didn't want to think about EMB badings right to me EMB batting just felt like it was just an external thing that was like well well it was just external to try actually tap into the capabilities of language models themselves right I really wanted to see you know just as like a human brain could like synthesize stuff could we create some sort of like structure where the this this like neural CPU if you will can like organize a bunch of information you know Auto summarize a bunch of stuff uh and then also Traverse the structure that I created that was the inspiration for this initial like tree index uh it didn't actually like like to be honest and I think I said this in the first TW it actually works super well right like jd3 at the time you're honest about that yeah I know I mean it was just like um gb4 obviously is much better at reasoning like I'm one of the first to say like you know you shouldn't use anything pre gbd4 for anything that requires like complex reasoning um because it's just going to be unreliable okay disregarding stuff like fine tuning but it worked okay but I think it definitely struck a chord with kind of like the the Twitter crowd which is just like looking for kind of um just like new ideas at the time I guess just like thinking about how you can actually bake this into some sort of application because I think what I also ended up discovering was the fact that there um basically everybody they were starting to become a wave of developers building on top of trib3 and people were starting to realize that what makes them really useful is to apply them on top of your personal data and so even if the solution itself was kind of like primitive at the time like the problem statement itself was very powerful and so I think being motivated by the problem statement right like this broad mission of how do I unlock LMS on top of the data also uh contributed to the development of llama index to the state it is today and so I think part of the reason you know our toolkit has Beyond um the like just existing set of like data structures is we really try to take a step back and think okay what what exactly are the tools I would actually make this useful for developer and then you know somewhere around December we made an active effort to basically like push towards that direction make the code based more modular right more friendly as an open source Library uh and then also start adding in like embeddings start thinking into practical considerations like latency cost performance those types of things and then um really motivated by that mission like start expanding the scope of the toolkit towards like hovering the the life cycle of like data injection and and quering yeah where you also added llama Hub and I yeah yeah so so I think that was in like uh January on the data loading side and so we start adding like some data loaders saw an opportunity there um started adding more stuff on the retrieval quering side right we still had like the core data structures but how do how do you actually make them more modular and kind of like decouple um storing state from the types of like queries I could run on top of this a little bit um and then starting to get into more complex interactions like Chain of Thought reasoning routing and you know like agent Loops yeah yeah very cool and then you and I spent a bunch of time earlier this year talking about llama Hub what that might become um you were still at at robust when did you decide it was time to start the company and then start to think about what llama index is today probably December yeah um and and so it was clear that you know it it's kind of interesting I was getting some inbound from initial VC I was talking about this project and then in the beginning I was like oh yeah you know this is just like a design project but you know what about my other idea on like video data right and then I was trying to like get get um their thoughts on that and then everybody was just like ah yeah whatever like that part's like a crowded market and then it became clear that you know this was actually a pretty big opportunity and like coincidentally right like this actually did relate to like my interests have always been at the intersection of AI um data and kind of like building practical applications and it was clear that this was evolving into a much bigger than the previous idea was um so around December and then I think I I gave a pretty long notice but I left um officially like early March what were your thinkings in terms of like modes and you know Founders kind of like overthink it sometimes you obviously had like a lot of Open Source love and like a lot of community and yeah like were you ever thinking okay I don't know this is maybe not enough to start a company or did you always have conviction about it oh no I mean 100% I felt like I did this exercise like um honestly probably more um late December and then early January because I was just existentially worried about whether or not this would actually be a company at all um and okay what were the key questions I I was thinking about and these were the same things that like other Founders uh investors and also like friends would ask me is just like okay what happens if context Windows get much bigger uh what's the point of actually structuring data right in in the right way um right why don't you just dump everything into the prompt uh fine tuning like what if you just train the model over this data um and then you know what's the point of doing this stuff uh and then um some other ideas is what if like open AI actually um just like takes this uh like you know builds upwards uh on top of the their existing like Foundation models and starts building in some like built-in orchestration capabilities around stuff like Rag and agents and those types of things and so I basically ran through this mental exercise and and you know I'm happy to talk a little bit more about those thoughts as well but at a high level uh while context Windows have gotten bigger but there's obviously still a need for for rag um I think rag is just like one of those things that like in general what people care about is yes they do care about performance but they also care about stuff like Lan and cost and my entire reasoning at the time was just like okay like yes maybe we'll have like much bigger context Windows as we've seen with like 100k context windows but for Enterprises like you know data which is not in just like the scale of like a few documents it's usually in like gigabytes terabytes pedabytes like how do you actually just unlock language models over that data right and so it was clear there was just like uh whether it's rag or some other Paradigm no one really knew what that answer was and so there was clearly like technical opportunity here like there was just a Stacks that needed to be invented to actually solve this type of problem because language models themselves didn't have access to this data and so if like you just dumped all this data into let's say a model had like hypothetically an infinite context window right and you just dump like 50 gigabytes of data into the context window that just seemed very inefficient to me because you have these network transfer costs of uploading 50 gbes of of data to get back a single response and so I kind of realized you know there's always going to be some curve regardless of like the performance of the best performing models of like um cost versus performance um and so um what rag does is it does provide extra data points along that access because you kind of control the amount of context you actually wanted to retrieve uh and of course like rag as a term was still evolving back then but it was just this whole idea of like how do you just fetch a bunch of information to actually you know like stuff into the prompt and so people even back and we're kind of thinking about some of those considerations and then you fundraised in June well you announced your fundrais in June yeah um with Greylock um how was that process uh just like take us through that process of thinking about the fundraise and um your plans for the company uh you know at the time yeah definitely I mean I think we knew we wanted to I mean obviously we knew we wanted to fundraise I think obvious there was also a bunch of like investor interest and was probably pretty unusual given the you know like hype wave of generative AI so like a lot of investors were kind of reaching out around like December January February in the end we went with grock grock's great you know they they've been great Partners so far um and like to be honest like there's there's a lot of like great VCS out there and a lot of them who are specialized on like open source data infra and that type of stuff um what we really wanted to do was um because for us like time was of the essence like we wanted to ship very quickly and still kind of build mind share in this space we just kept the fundraising process very efficient I think we basically did it in like a week um or or like three days so yeah just like front loaded it and then and then just uh picked the one named Jerry yeah exactly um yeah I'm kidding guys I mean he's obviously great and and and great loocks are fantastic for him yeah know and uh and batting some larar so so yeah just um we we picked we we picked grock uh they've been great Partners um I think in general when I talk to Founders about like the fundraise process um it's never like the most fun period I think because it's always just like you know there's a lot of logistics there's lawyers you have to you know know get in the loop and then and like a lot of Founders just want to go back to building um and so I think in the end we're happy that we kept it a pretty efficient process cool um and so you fundraised with Simon uh your co-founder and um how do you split things with him how big is your team now the team is growing um by the time this podcast is released uh we'll probably have had one more person join the team um and so basically uh it's between uh we're rapidly getting to like eight or nine people uh at the current moment we're around like six and so just like um be some exciting developments in the next few weeks I'm excited to to kind of um to to announce that we've been pretty um uh selective in terms like how we like grow the team obviously like we look for people that are really active in terms of contributions to L index people that have like very strong engineering backgrounds and primarily we've been kind of just looking for Builders uh people that kind of like grow the open source and also eventually this like manage like Enterprise platform as well with us um in terms of like Simon yeah I've known Simon for a few years now I knew him back at Uber atg in Toronto um he's you know one of the smartest people I knew um uh like you know has a sense of both like a deep understanding of ml but also just like first principles thinking about like engineering and Technical Concepts in general and I think one of my criteria when I was like looking for a co-founder for this project was someone that was like technically better than me because I knew I wanted like a CTO and so honestly like there weren't a lot of people that I mean there there's I know a lot of people that are smarter than me but like that fit that bill were willing to do a startup and also just had the same like values that I shared right and just I think doing a startup is very hard work right it's not like I'm sure many like you guys all know this it's it's a lot of hours um a lot of late nights um and you want to be like in the same place together and just like being willing to hash out stuff and and have that grit basically and I I really look for that and so Simon really um fit that bill and I think I convince him to BR jump on word yeah yeah nice job um and obviously I've had the pleasure of chatting and working with uh a little bit with both of you um what would you say those those like your top like one or two values are when when thinking about that or the culture of the company and that kind of stuff yeah well I I think in terms of um the culture of the company it's it's really like uh I mean there's a few things I can a off top my head uh one is just like uh passion Integrity I think it's very important for us we want to be honest we don't want to like obviously like copy code or or kind of like you know just like you know not give attribution those types of things and and just like be true to ourselves I think we're all very like down toe like humble people but obviously I think just willingness to just like own stuff and Dive Right In And I think grit comes with that I I think in the end like this is a very fast moving space and we want to just like be one of the you know like dominant forces and helping to provide like production quality allm applications yeah um so I promise we'll get to more technical questions soon but I also want to impress on the audience um that this is you know very conscious and um intentional company building and since your um fundraising post which was in June um and now it's September so it's been about three months you've actually gained 50% uh in in terms of stars in followers you 3x your download count to 600,000 a month uh and your Discord membership has reached 10,000 so like a lot of ongoing growth yeah definitely and and obviously there's a lot of room to expand there too um and so open source growth is going to continue to be one of our core uh goals uh because in the end it's just like we want this thing to be well one big right we all have like big Ambitions but two just like really provide value to developers in helping them in prototyping and also productionize fortunate circumstance for a lot of different companies and individuals right are in that phase of like you know maybe they've hacked around on some initial LM applications but they're also looking to you know start to think about what are the production grade challenges necessary to actually um you know uh that to solve to actually make this thing robust and reliable in the real world and and so we want to basically provide the tooling to do that and to do that we need to both spread awareness and education of a lot of the key practices of what's going on and so a lot of this is going to be continued growth expansion and education and we we do prioritize that very happily awesome um let's dive into some of the questions you were asking yourself uh initially around fine tuning and rag how these things play together um you mentioned context what is the minimum viable context for rag so what's like a context window too small and at the same time maybe what's like a maximum context window we talked before about the llms are u-shape reasoners so as the context got larger like it really only focuses on the end and the start of the prompt and then it kind of um pet down any learnings any kind of like tips you want to give people um as they think about it so this is a great question and um I think part of what I wanted to um kind of like talk about a conceptual level especially with the idea of like thinking about what is a minimum context like okay what if the minimum context was like 10 tokens versus like you know 2K tokens versus like a million tokens right like and what does that really give you and what are the limitations if it's like 10 token tokens it's kind of like um like 8 bit 16 bit games right like back in the day like if you play Mario um and you have like the initial Mario where the graphics were very blocky and now obviously it's like full HD 3D just the resolution of the context and the output will change depending on how much context you can actually fit in um the the way I kind of think about this in a more principal manner is like there's this concept of like um information capacity um just this idea of like entropy like given any fixed amount of like storage space like how much information can you actually compact in there and so basically a context window length is just like some fixed amount of storage space right and so there's some theoretical limit to the maximum amount of information you can compact into like a 4,000 token storage space and what is that storage space used for these days with llms it's for inputs and also outputs and so this really controls a maximum amount of information you can feed in terms of the prompt plus the granularity of the output if you had an infinite context window you could have an infinitely detailed response and also infinitely detailed memory but if you don't you can only kind of represents stuff in more quantized bits right and so the smaller the context window uh just generally speaking the less details and and maybe the less um in like specific precise information you're going to be able to surface at any given point in time and when you have short context is the answer just like get a better model or is the answer maybe hey there needs to be a balance between fine tuning and rag to make sure you're going to like Leverage the context but at the same time don't keep it to low resolution yeah yeah well there's probably some minimum threat like I don't think anyone wants to work with like a 10 I mean that's just a thought exercise anyways a 10 token context window I think nowadays the modern context one is like 2K 4K is enough for for just like doing some sort of retrieval on granular context and be able to synthesize information I think for most intents and purposes that level of resolution is probably fine for most people for most use cases I think the limitation is actually more on okay if you're going to actually combine this thing with some sort of retrieval data structure mechanism there's just limitations on the retrieval side um because maybe you're not actually fetching the most relevant context to actually answer this question right like yes you like given the right context 4,000 tokens is enough but if you're just doing like top case similarity like you might not be be fetching the right information from the documents yeah so how should people think about when to stick with rag versus when to even entertain fine tuning and also in terms of what's like the threshold of data that you need to actually worry about fun tuning versus like just stick with rag obviously you're biased because you're bu a r company but no actually um I think I have like a few hot takes in here some which sound like a little bit contradictory of what we're actually building to be honest I don't think anyone knows the right answer I think this ising the truth yeah exactly this is just like thought exercise towards like understanding the truth right so so I think um okay I have a few hot takes one is like rag is basically just just a hack it turns out it's a very good hack uh because uh what is rag rag is you keep the model fix and you just figure out a good way to like stuff stuff into the prompt of of the language model everything that we're doing nowaday in terms of like stuffing stuff into the prompt is just algorithmic we're just figuring out nice algorithms to to like retrieve right information with topk similarity do some sort of like hybrid search some sort of like a train of thought decm and then and it just like stuff stuff into the prompt so it's all like algorithm algorithmic um and it's more like just software engineering to try to make the most out of these like existing apis the reason I say it's a hack is just like from a pure like optimization standpoint if you think about this from like the machine learning lens unless the software engineering lens there's pieces in here that are going to be like suboptimal right like obviously like the thing about machine learning is when you optimize like some system that can be optimized within machine learning like the set of parameters you're really like changing like the entire system's weights to try to optimize this objective function and if you just Cobble a bunch of stuff together you can't really optimize the pieces that are inefficient right and so like a retrieval interface like doing topk and Bing lookup that part is uh inefficient because there might be potentially a better more learned ret fre algorithm that's better um if you know you um kind of do stuff like some sort of I know nowadays there's this concept of how do you do like short-term or long-term memory right like represent stuff in some sort of vector embedding do trunk sizes all that stuff it's all just like decisions that you make that aren't really optimized right it's it's more and it's not really automatically learned it's more just things that you set beforehand to actually feed into the system there's a lot of room to actually optimize the performance of an entire LM system um potentially in a more like machine learning based way right and and I I will leave room for that and this is also why I think like in the long term like I do think fine-tuning will probably have like greater um importance um and and and just like there will probably be new architectures invented that where you can actually kind of like include a lot of this under the black box as opposed to having like hobbling together a bunch of components outside the black box that said just very practically again with the current state of things like even if I said rag is a hack it's a very good good hack and it's also very easy to use right and so just like for kind of like the AI engineer Persona that like uh which uh to be fair is kind of one of the reasons generative AI has gotten so big is because it's way more accessible for everybody to get into as opposed to just like traditional machine learning um it tends to be good enough right and if we can basically provide these existing techniques to help people really optimize how to use existing systems without having to really deeply understand machine learning I still think that's a huge value ad and so there's very much like a ux ease of use problem here which which is just like rag is way easier to onboard and use um and that's probably like the primary reason why everyone should do rag instead of fine tun to begin with um if you think about like the 8020 rule like rag very much fits within that and fine tuning doesn't really right now um and then I'm just kind of like leaving room for the future that you know like in the end um fine-tuning can probably take over some of the aspects of of like what rag does I don't know if this is mentioned in your uh in your recap there but um explainability also allows for sourcing and and like at the end of the day like to increase trust we have to Source documents yeah so so I think what rag does is it increases like transparency visibility into the actual documents right that are getting fed into the cont here's where they got it from exactly um and so that's definitely an advantage I think the other piece that I think is an advantage and I think that's something that um someone actually brought up is just uh you can do access access control with with brag if you have an external storage system um you can't really do that with with large language models it's just like gate information to the neural net weights like um depending on the type of user for the first point you could technically right you could technically have the language model like if it memorized enough information just to like site sources but there's a question of just trust whether or not you yeah yeah well but but like it makes it up right now because it's like not good enough but imagine a world where it is good enough and it does give accurate citations yeah no I think to establish trust you just need a direct uh connection so it's it's kind of weird it's it's it's this melding of of you know deep Learning Systems versus very traditional uh information retrieval yeah exactly well so so I think I mean I kind of think about it as analogous to like humans right like uh we as humans obviously we use the internet we use tools uh these tools have API interfaces are well- defined um and obviously we're not like the tools aren't part of us and so we're not like back propping or optimizing over these tools uh and so kind of when you think about like rag it's basically um llm is learning how to use like a vector database to look up information that doesn't know and so then there's just a question of like how much information is inherent within the network itself and how much does it need to do some sort of like tool use to look up stuff that it doesn't know and I do think there'll probably be more and more of that interplay as time goes on yeah um some follow-ups on discussions that we've had uh so you know we discussed fine tuning in a bit and what's your current take on whether can you can fine-tune new knowledge into LMS that's one of those things where I think longterm you definitely can um I I think some people say you can't I I disagree agree I think you definitely can just right now I haven't gone it to work yet so so I think like Tri yeah well um not in a very principled way right this something that requires like an actual research scientist and not someone that has like you know an hour or two per night to you were research scientist at Uber I mean yeah but it's like fulltime fulltime looking so so I think um what I specifically concretely did was I took open AI fine tuning end points and then tried to you know it's in like a chat message interface and so there's like a user assistant message format and so what I did was I tried to take just some piece of text and have that um memorize it by just asking it a bunch of questions about the text so given a bunch of context I would generate some questions and then generate some response and just fine tune over the question responses um that hasn't really worked uh super well um but that's also because I'm I'm just like trying to like use open AI end points as is um if you just think about like traditional like how you train a Transformers model um there's kind of like the uh instruction like fine tuning aspect right you like um kind of ask it stuff and guide it with correct responses but then there's also just like um next token production um and that's something that you can't really do with the open AI API but you can do with if you just train it yourself um and that's probably possible if you just like train it over some Corpus of data I think shashir from Berkeley had like you know when they trained gorilla they were like oh you know this uh a lot of these LMS are actually pretty good at memorizing information um just the way the API interface is exposed is just no one knows how to use them right now right and so so I think that's probably one of the issues uh just to clue people in who haven't read the paper gorilla is the one where they uh trained to use specific apis yeah yeah and I think they also did something where like the model itself um could learn to um I yeah I think this was in the Gill paper like the the the model itself could uh try to learn some prior over the data to decide like what tool to pick but there's also it's also augmented with retrieval that helps supplement it in case like the the the um prior doesn't actually work is that something that You' be interested in supporting I mean I think in the long term like if like this is kind of how fine-tuning like rag evolves like I I do think there will be some aspect where fine-tuning will probably memorize some high level concepts of knowledge but then like rag will just be there to supplement like aspects that it doesn't know yeah yeah um obviously rag is the default way like to be clear rag right now is the default way to actually augment stuff with knowledge um I think it's just an open question of how much the Alum can actually internalize both highle Concepts but also details um as you can like train stuff over it and coming from an ml background like there is a certain Beauty in just baking everything into some training process of the of the language model like um if you just take raw chat gbt or chat gbt code interpreter right like gbt 4 it's not like you do rag with it you just ask it questions about like Hey how do I like Define a identic model in Python and then like can you give me an example can you visualize a graph for me it just does it right like and and we run it through code interpreter as a tool but that's not like a source for knowledge that's just an execution environment and so there is some beauty in just like having the model itself like just you know in instead of you kind of defining the algorithm for what the data structure should look like the model just learns it under the hood um that said I think the reason it's not a thing right now is just like no one knows how to do it it probably costs too much money and then also um like the API interfaces and and just like the the actual like ability to kind of evaluate and improve on performance um like isn't known to most people yeah um it also would be better with browsing yeah uh I wonder when they're going to put that back um okay cool um yeah so and then one more followup before we go into rag for AI Engineers uh is on you me you brief mentioned about um security or off um and how many of your the people that you talk to you know you talk to a lot of uh people putting llama index into uh production how many people actually are there with versus just like let's just dump a whole company notion into this thing wait are you talking about from like the security off standpoint yeah how how how big a need is that because I I um I talked to some people who are thinking about building tools in that domain but I don't know if people want it I mean I think bigger companies like um just bigger companies like Banks consulting firms like they all want this requirement right the way they're using llama index is not uh with this obviously because I don't think we have support for like Access Controller author that type of stuff like on Hood because we're more just like an orchestration framework um and so the way they do it they they build these initial apps is more kind of like prototype like let's kind of yeah like you know use some publicly available data that's not super sensitive let's like you know assume that every user is going to be able to have access to the same amount of knowledge those types of things um I think users have asked for it but I don't think that's like a p z like I think the P0 is more on like can we get this thing working before we expand this to like more users within the work yep cool so there's a bunch of pieces to rag obviously it's not a just an acronym and youed recently you think every AI engineer should build it from scratch at least once um why is that I think so um I I I'm actually kind of curious to hear your thoughts about this um but this kind of relates to the initial like AI engineering post that you put out um and then also just like the role of an AI engineer and the skills that they're going to have to learn to truly succeed um because there's an entire Spectrum on one end you have people that don't really uh like understand the fundamentals and just want to use this to like Cobble something together to to build something and I think there is a beauty in that for what it's worth like it's just one of those things and and J geni has made it so that you can just use these models in inference only mode Cobble something together us it power your app experiences on the other end what we're increasingly seeing is that like more and more developers building with these apps start running into honestly like pretty similar issues that like will plague just a standard ml engineer building like a classifier model which is just like accuracy problems like and Hallucination is basically just an accuracy problem right like it's not giving you the right results so what do you do you have to iterate on the model itself you have to figure out what parameters you tweak you have to gain some intuition about this entire process that workflow is pretty similar honestly like even if if you're not training the model to just like tuning a ml model with like hyper parameters and learning like proper ml practices of like okay how do I have like Define a good evaluation Benchmark how do I Define like the right set of metrics to use right how do I actually iterate and improve the performance of this pipeline for production what tools do I use right like every ml engineer use like some form of weights and biases tensor board or like some other experimentation tracking tool like um what tools should I use to actually help build like LM applications and optimize it for production there's a certain amount of just like llm Ops like tooling and Concepts and and just like practices that people kind of have to internalize if they want to optimize these and so I think that the reason I I think like being able to build like rag from scratch is important is it really gives you a sense of like how things are working to get help you build intuition about like what parameters there are within a rag system and which ones actually tweak to make them better one of the advantages of lendex the lendex quick start is it's three lines of code um the downside of that is you have zero visibility into what it's actually going on under the hood and I think this is something that we've kind of been thinking about for a while and I'm like okay let's just release like a new tutorial series that's just like instead not no three lines of code we're just going to go in and actually show you how the thing actually works on the hood right and so I like does everybody need this like probably not like as for some people the three lines of code might work um but I think increasingly like like honestly 90% of users I talk to have questions about how to improve the performance of their app and so just like given this is just like one of those things it's like better for the understanding yep I'd say uh it is one of the most useful tools of any sort of developer education toolkit to write things for yourself from scratch um so Kelsey hiow famously wrote uh kubernetes the hard way which is don't use kubernetes just like do everything here's here's everything that you would have to do uh by yourself and you should be able to put all these things together yourself to understand the value of kubernetes um and the same thing for llama index I've done I was the guy who did the same for react um and yeah it's pretty well it's it's pretty it's a pretty good exercise for you to just fully understand everything that's going on under the hood um and I was I was actually going to strest while in one of the previous conversations um you know there's all these like hyper parameters like the the size of the chunks and all that um and uh I always always thinking like you know what what if what would hyper parameter op optimization for rag look like yeah definitely I mean so absolutely I think that's going to be a increasing thing I think that's something we're kind of looking at cuz like I think someone should just put do like some large scale study and just ablate everything and just you you tell us I I think it's going to be hard to find a universal default that works for everybody I I think it's tell meend I I do think it's going to be somewhat like dependent on the data in use case I think if there was a universal default that would be amazing um but I think inreasing we found you know people are just defining their own like custom parsers for like PDF markdown files for like you know SEC filings versus like you know slack conversations uh and and then like the use case too like do you want like a summarization like the granularity of the response like it really affects the parameters that you want to pick um and so I I do like the idea of hyperparameter optimization though but it's kind of like one of those things where you are kind of like training the model basically um kind of on your own data domain yeah uh you mentioned custom parses um you've designed llama index maybe we can talk about like the surface area of the framework uh you designed llama index in a way that it's more modular yeah like you mentioned um how would you describe the different components in what's customizable in each yeah I think they're all customizable um and I think that there there is a certain burden on us to make that more clear through the docs um but well number four is customization tutorial so yeah yeah but but I think like just in general I think we we do try to make it so that you can plug in that out of the box stuff but like if you if you want to kind of um customize more lower level components like we definitely encourage you to do that and plug it into the rest of our abstractions so let me just walk through like maybe some of the basic components of L index there's data loaders you can load data from different data sources we have llama Hub which you guys brought up which is um you know a collection of different data loaders of like unstructured unstructured data um like PDFs file types like uh slack notion all that stuff um now you load in this data we have a bunch of like parsers and and Transformers you can split the text you can add metadata to the text uh and then basically figure out a way to uh load it into like a vector store um so I mean you worked at like airb right it's kind of like there's some aspect like eent right in terms of like transforming this data um and then the right loading it into some storage abstraction we have like a bunch of Integrations with different document storage systems um so that's data uh and then the second piece really is about like um how do you retrieve this data um how do you like synthesize this data and how do you like do some sort of higher level reasoning over this data so retrieval is one of the core abstractions that we have uh we do encourage people to like customize to find your own retrievers uh that's why we had that section on kind of like how do you define your own like custom retriever but also we have like out of the box ones um the the retrieval algorithm kind of depends on how you structure the data obviously like if you just flat index everything with like chunks with like embeddings then you can really only do like topk like lookup uh plus maybe like keyword search or something but if you can index it in some sort of like hierarchy like to find relationships you can do more interesting things like actually Traverse relationships between nodes um then after you have this data how do you like synthesize the data right um and and this is the part where you feed it into the language model there's some response abstraction that can abstract away over like long context to actually still give you a response even if the context overflows a context window uh and then there's kind of these like higher level like reasoning Primitives that I'm going to Define broadly and I'm just going to call them in some general bucket of like agents um even though everybody has different definitions of agents and agents but you're the first to data agents which I was very excited about yeah we we kind of like coin coin that term and the way we we thought about it was you know we wanted to think about how to use agents for uh like data workflows basically and and so what are the reasoning Primitives that you want to do so the most simple reasoning formative you can do is some sort of routing module like you can just um it's a classifier like given a a query just make some automated decision on what choice to pick right um you could use LMS you don't have to use LMS you could just train classifier basically um that's something that we we might actually explore uh and then the the next piece is okay what are some higher level things you can have the LM like Define like a query plan um right to actually execute over the data you can do some sort of while loop right that's basically what an agent Loop is which is like react um tree of thoughts um like Chain of Thought like the open AI function calling like wall Loop um to try to like take a question and try to break it down into some some series of steps to actually try to execute to get back a response um and so there's a range in complexity from like simple reasoning Primitives to more advanced ones and I think that's the way the way we kind of think about it is like uh which ones should we Implement and how do they work well like do they work well over like the types of like data tasks that we give them how do you think about optimizing each piece so take um embedding models is one piece of it uh you offer fine tuning embedding models and I saw it was like fine tuning gives you like 5 10% increase what's kind of like the Delta left on the EMB batting side do you think we can get models that are like a lot better do you think like that's one piece where people should really not spend too much time I mean I think they should I I just think it's it's not the only parameter because I think in the end um uh if you think about everything that goes into retrieval the chunking algorithm um how you define like uh metadata right uh well bias your EMB batting representations then there's the actual embedding model itself which is something that you can try optimizing and then there's like the retrieval algorithm are you going to just do topk or you're going to do like hybrid search or you're going to do auto retrieval like there's a bunch of parameters and so I do think it's something everybody should try um I I think uh by default we use like open Ai and betting Model A lot of people these days use like sentence Transformers because it's it's just like free open source and you can actually optimize directly optimize it um this is an active area of exploration I do think one of our goals is is um it should ideally be relatively free for every developer to just run some fine tuning process over their data to squeeze out some more points in performance and if it's that relatively free and there's no downsides everybody should basically do it um there's just some complexities right in terms of optimizing your edting model especially in a production grade data pipeline if you actually fine-tune with the embedding model um and the embedding space changes you're going to have to reindex all your documents uh and for a lot of people that's not feasible um and so I think like Joe from vesa on our webinar is like there's this idea that um uh depending on kind of like if you're just using like um document and aquari embeddings you could keep the document embeddings Frozen and just train a linear transform on the query or or any sort of transform on the query right so therefore it's just a query side transformation instead of actually having to reindex all the document embeddings um the other piece is W that's pretty smart yeah yeah so so I think uh we we we weren't able to get like huge performance Gaines there but it does like improve performance a little bit and that's something that basically you know everybody should be able to kick off you can actually do that in l do too opening has a cookbook on uh adding bias to the emits too right yeah yeah I think so yeah there's just like different parameters that you can you can try adding to try to like optimize the retrieval process um and the idea is just like okay um by default you have all this text I kind of lives in some uh lat in space right shut out shut out lat space you should take a drink every time but it lives in some latent space um but like depending on the type specific types of questions that the user might want to ask the Laten space might not be optimized right for for actual uh like to actually um retrieve the relevant piece of contacts that the user want to ask so can you shift the EMB batting points a little bit right and how do we do that basically that's really the key question here so optimizing the EMB batting model um even changing the way you like chunk things these all shift the EMB batting so the retrieval is interesting I got a bunch of startup pitches that are like look rag is cool but like there's a a lot of stuff in terms of ranking that could be better there's a lot of stuff in terms of um uh sunsetting data once it starts to become stale that could be better um are you gon to move into that part too so like you have sec insights is one of kind of like your demos and that's like a great example of hey I don't want to embit all the historical documents because a lot of them are outdated and I don't want them to be in the context what's that problem space like how much of it are you going to also help with and versus how much you expect others to take care of yeah I'm I'm happy to talk about SEC insights in just a bit I think more broadly about the like overall retrieval space we're very interested in it because a lot of these are very practical problems that people have access um so the idea of outdata data I think um how do you like deprecate or time weight data um and do that in a reliable manner I guess so you don't just like kind of set some parameter and all of a sudden that affects your all your retrieval algorithms is pretty important because uh people have started bringing that up like I have a bunch of duplicate documents things get out a day how do I like Sunset documents um and then ranking right yeah so I think this space is not new um I think uh like rather than inventing like new retriever techniques for the sake of like just inventing better ranking um we want to take existing ranking techniques and kind of like package it in a way that's like intuitive and easy for people to understand that said I think there are interesting and new retrieval techniques that uh are kind of in place that can be done um with when you tie it into some Downstream rack system I mean like the the reason for this is just like if you think about how um like the idea of like chunking text right like that that just really wasn't a thing um or at least for this specific purpose of like like the reason chunking is a thing in rag right now is because like you want to fit within the context window of an llm right uh like why do you want a chunk a document that that just was less of a thing I think back then if you wanted to like uh transform a document it was more for like structure data extraction or something in the P um and so there's kind of like certain New Concepts that you got to play with um that you can use to invent kind of more interesting retrieval techniques another example here is actually LM based reasoning like llm based Chain of Thought reasoning you can take a question break it down into smaller components and use that to actually uh send to your retrieval system and that gives you better results than kind of like sending the full question to a retrieval system that also wasn't really a thing back then but then you can kind of figure out an interesting way to like blending old and the new right with LMS and then data yeah there's a lot of ideas that you come across um do you have a store of them so okay I I think the the um uh yeah I think I sometimes I get like inspiration there's like some problem statement and I'm following you is very hard because it's just a lot of homework so I think I I've I've started to um uh like step on the brakes just a little bit cuz then keep going no no well the the reason is just like okay if I just have invent like a hundred more retrieval techniques like like sure but like how do people know which one is good and which one's like bad right and so have a librarian right like it's going to catalog it and go you're going to need some like benchmarks and so I think that's probably the the focus for the next next few weeks actually like properly kind of like having an understanding of like oh you know when should you do this or like what does this actually work well yeah some kind of like a maybe like a flowchart decision tree type of thing yeah exactly win this do that yeah something like that that would be really helpful for me thank you yeah um do you want to talk about SEC insights um sure yeah you had a question yeah yeah just I mean that's kind of like a good it seems like your most successful side project yeah okay so what what is SEC inside for our listeners um our SEC insights is a full stack uh llm chatbot application um that does analysis over your SEC 10K and 10q filings uh I think and and so um the the goal for building this project is really twofold um the reason we started building this was one it was a great way to dog food the production Readiness for our library um we actually ended up like adding a bunch of stuff and fixing a ton of bugs because of this and and I think it was great because like you know thinking about how we handle like callbacks streaming um actually generating like reliable sub responses and bubbling up sources of citations these are all things that like you know if you're just building the library in isolation you don't really think about it but if you're trying to tie this into a downstream application like it really starts mattering is this for your error messages you you talk about bubbling up stuff for observability like sources like if you go into SEC insights and you type something you can actually see the highlights in in the right side um and so like yeah that that was something that like took a little bit of like um understanding to figure out how to build wall and so it was great for dog fooding Improvement of the library itself and then as we're building the app um the second thing was we're starting to talk to users and just like trying to Showcase like kind of uh bigger companies like the potential of LOM index as a framework um because these days obviously building a chap bot right with Streamlight or something it'll take you like 30 minutes or an hour like there's plenty of templates out there on lendex L train like you can just build a trap but but how do you build something that kind of like satisfies some of these uh this like criteria of surfacing like citations being transparent seeing like having a good ux um and then also being able to handle different types of questions right like more complex questions that compare different documents that's something that I think people are still trying to explore and so what we did was like we showed both like uh like well first like organizations the possibilities of like what you can do when you actually build something like this and then after like you know we kind of like stealth launched this for fun just as a separate project uh just to see if we could get feedback from users who are using this to to see like you know how we can improve stuff and then we thought we thought like ah you know um we built this right obviously we're not going to sell like a financial app like that's not really our in our wheelhouse uh but we're just going to open source the entire thing and so that now is basically just like a really nice like full stack app template you can use and customize on your own right to build your own trapl whether it is over like financial documents or over like other types of documents um and it provides like a nice template for basically anybody to kind of like go in and and get started um there's certain components there that like aren't released yet that we're going going to in the next few um next few weeks like one is just like kind of more detailed guides on like different modular components within it so if you're like a full stock developer you can go in and actually take the pieces that you want and actually kind of build your own custom flows the second piece is like take there's like certain components in there that might not be directly related to the llm app that would be nice to just like have people use uh an example is the PDF viewer like the PDF viewer with like citations I think we're just going to give that right so you know you could be using any Library you want but then you can just you know just drop in a PDF viewer right so that it's just like a fun little module that you could view plug in nice yeah that that's a that's really good community service right there m well so I want I want to talk a little bit about like your Cloud offering um cuz you mentioned I forget the name that that you had for it Enterprise something um well one we haven't come up with the name we're kind of calling it l index um platform uh platform L index Enterprise I'm open to suggestions here um so I think um I think the high level of what I can um probably say is just like like yeah I think we're looking at ways of like actively kind of complimenting the developer experience like building llama index um you know we've always been very focused on stuff around like uh plugging in your data into the language model and so can we build tools that help like augment that experience beyond the open source library right and so I think what we're going to do is like make build an experience where it's very seamless to transition from the open source Library um with like a oneline toggle you can basically get this like complimentary service and then figure out a way to like monetize in a bit I think where our Revenue focus is this year is kind of um is less emphasized like it's more just about like can we build some managed offering that like provides complimentary value to what the open source Library provides yeah I think it's it's the classic thing about all open source is you want to start building the most popular open source projects in your category to own a category uh you're going to make it very easy to host therefore then you have to you you're just built your biggest competitor which is you yeah it'll be fine I think it'll be like complimentary because I think it it'll be like you know you use the open source library and then you have a toggle and all of a sudden you know you can see this um basically like a pipeline is thing um pop up and then it'll be able to uh kind of like you'll have a UI um there'll be some Enterprise guarantees and the end goal would be to help you build like production rag out more easily yeah great awesome um should we go on to like ecosystem and stuff go ahead um they data loaders there's a lot of them uh what are maybe some of the most popular Maybe under not underrated but like undere expected you know and how has the open source side of it helped with like getting a lot more connectors you only have six people on the team today so you couldn't have done it all yourself I'm sure sure yeah I think the nice thing about like llama Hub itself is just it's supposed to be a community-driven hub um and so actually the bulk of the peers are completely Community contributed um and so we um haven't written that many like first party connectors actually for this it's more just like uh kind of encouraging people to to uh contribute to the Community um in terms of the most popular tools uh or the data loaders I think we have Google analytics on this and I forgot the specifics it's some mix of like the PDF loaders um we have like 10 of them but there's some subset of them that are popular and then there's um uh Google like I think Gmail like G Drive um and then I think maybe it's like one of slack or Doan um one thing I will say though uh and I think like swix might probably knows this better than I do given that users used to work at airite is like it's very hard to build like especially for a fullon service like notion slack or like Salesforce to build like a really really high quality loader that really extracts all the information that people want right um and so I think the thing is um when people start out like they will probably use these loaders and it's a great tool to get started and for a lot of people it's like good enough and they submit PRS if they want more additional features if like you get to a point where you actually want to call like an API that hasn't been supported yet or you know you want to kind of um uh load in stuff that like in metadata or something something that hasn't been directly baked into the logic of the loader itself uh people start adding up like writing their own custom loaders and that is a thing that we're seeing uh and that's something that we're okay with right because like a lot of this is more just like Community Driven and if you want to submit a PR to improve the existing one you can otherwise you can create your own custom ones yeah and all that is custom load is all supported within Lama index or do you pair it with something else uh oh it's just like I mean um you just Define your own subass I think I think that's it yeah yeah because um typically in the data ecosystem with airb um you know airb has his own strategy with custom loaders but also you could write your own with like dagster or like prefects or one of those tools yeah yeah exactly so I think for us it's more we just have a very flexible like document abstraction they can fill in with any content that you want okay um are people really dumping all their Gmail into these things um you said Gmail is number two yeah it's like one of uh Google some Google product I think it's I think it's I think it might be yeah wow uh I'm not sure actually I mean it's that's the you know that's the most private um data source that's true so I'm surprised that people don't meet you I mean I'm sure some some people are but like I'm sure I'm surprised it's popular yeah let me revisit the Google analytic I want to make sure I give you give you the accurate response yeah yeah um well and then so uh the LM engine uh I assume opening ey is going to be a majority is it an overwhelming majority uh how what's the market share between like open eye coher anthropic you know whatever you're seeing open source too open has a majority but then like there's anthropic and there's also um open source I think there's a lot of people trying out like llama to and and um some variant of like a top open source model side note any confusion there llama 2 versus llama yeah I think whenever I go to these talks I always open it up with like we started we started before meta right I want to point that out uh but no Pro prop we try to use it for like branding we just add two llamas when we have like a llama to integration instead of one llama anyways uh so so the um uh yeah so so I think a lot of people are trying out the popular open source models and we have these days we have like um there's a lot of tool kits and open source projects that allow you to self-host and deploy llama to um right and and like o llama is just a very recent example I think that we we add an integration with and so we just uh by virtue of having more of these Services I think more and more people are trying it out yeah do you think there's there's potential there is like um is there going to be an increasing Trend like open source yeah yeah definitely I think in general people hate monopolies and so um like there's a whenever like open AI has something really cool or like any um company has something really cool even meta like there's just going to be a huge competitive pressure from other people to do something that's more open and better um and so I do think just Market pressures will will improve like open source ad option last thing I'll say about this which is just really like it's it gets clicks it's people like are like psychologically want that but then at the end of the day they want they fall for brand name and popular and and per performance benchmarks you know um and at the end of the day opening ey still wins on on that I think that's true um but I I just think like um unless you're like an active employee at opening ey right like all these research labs are putting out like ml like phds or or kind of like other companies too that are investing a lot of dollars uh there's going to be a lot of like competitive pressures developed like better models so is it going to be like all fully open source with like a permissive license like I'm not completely sure but like there's just a lot of just incentive for people to develop better stuff here have you looked at like rag specific models like contextual um no is it public or no they they literally just uh so dwey Kila I think is is his name you probably came across him um he wrote the rag paper at meta um and just started contextual AI to create a rag specific model I don't know what that means I was hoping that you do because it's your business yeah if I had inside information I mean you know to be honest I think this this kind of relates to my previous point on like Rag and fine tuning like a rag specific model is a model architecture that's designed for better Rag and it's less the software engineering principle of like how can I take existing stuff and just plug and play different components into it um and there's a beauty in that from ease of use and modularity but um like when you want to end to endend optimize the thing you you might want a more specific model um I just yeah I don't know I I think I think building your own models is honestly pretty hard um and I think the issue is if you also build your own models like you're also just going to have to keep up with like the r of LM advances like how like basically the question is when gbt 5 and six and whatever like anthropic CLA 3 comes out like what's how can you prove that you're actually better than a software developer just cing together on components on top of a base botle right even if just like conceptually this is better than maybe like gpt3 or GPT 4 yeah yeah base model game is expensive yeah um what about Vector stores I know this book says wearing a chroma sweatshirt they got good sag game I I have the mug uh from choma great yeah um what do you think what do you think there like there's a lot of them are they pretty interchangeable for like your users use case uh is hnsw all we need is the room for improvements there ISP all we need yeah yeah it's the I think yeah we try to remain unopinionated About Storage providers so it's not like we we don't try to like play favorites so we have like a bunch of Integrations obviously and we the way we try to do is we just Tred to find like some standard interfaces but obviously like different Vector stores will support kind of like slightly additional things like metadata filters and those things um and the goal is to have our users basically leave it up to them to try to figure out like what makes sense for their use case um in terms of like the algorithm itself um I don't think the Delta on like improving the vector store like embeding look up algorithm is that high I think this stuff has been mostly solved um or at least there's just a lot of other stuff you can do um to try to improve their performance like what no I mean like everything else that we just talked about like in terms of like accuracy right to improve rag like everything that we talked about like chunking like metadata like um yeah well I mean I was just thinking like uh maybe for me the interesting question is you know there are like eight it's kind of game of throws there's like eight the war of eight databases right now oh oh I see um how do they stand out and how do they become very good partners with Lama index um oh I mean I think we're yeah we're pretty good partners with with most of them uh let's see um well like so if you're you know Vector founder like what do you what do you work on it's a good question I think one thing I'm very interested in is um and this is something I think I've started to see a general Trend towards is combining structured data quering with on structure data querying and I think that will probably just um expand the query sophistication of these Vector Stores um and basically make it so that users don't have to think about whether they would you call this like hybrid querying is that what we v8's doing yeah I mean I think like if you think about metadata filters that's basically a structured filter it's like a select start or select wear right um something equal something and then you combine that with semantic search I know I think like Lance DB or something was like try is trying to do some like joint interface um the reason is like most data is semi-structured there's some structured annotations and there's some like unstructured text and So like um somehow combining all the expressivity of like SQL with like the flexibility of semantic search is something that I think is going to be really important right and we have some basic hacks right now that allow you to jointly query both a SQL database and like a separate SQL database and a vector storage to like combine the information that's obviously going to be less efficient if you just combined it into one system yeah and so I think like PG Vector like you know that type of stuff I think it it's starting to get there but like in general like how do you have an expressive query language to actually do like structured querying along with like all the capabilities of semantic search so your current favorite is just put it into postgress uh no no no I we don't the postgress language the the query language I actually don't know what the best language would be for this cuz I think the it will be something that like the model hasn't been fine-tuned over um and so you might want to train the model over this but some way of like expressing um structured data filters um and this could be include time too right it could it doesn't have to just be like a wear Clause uh with this idea of like semantic search yeah yeah and we talked about uh graph representations yeah oh yeah that's another thing too and there's like and yeah so that's actually something I I didn't even um bring up yet like there's this interesting idea of like can you actually have the language model like explore like relationships within the data too right and somehow combine that information with stuff that's like more more um structured within the DB awesome um what else is left in the stack oh EVS yeah um what are your current strong belief about how to evaluate rag I think I have thoughts I think we're trying to curate this into some like more opinionated principles um because there's some like open questions here I think one question I had to think about is whether you should do like uh evals like component by component first or you should just do the end to end thing um I think you should you might actually just want to do the end to end thing first just to do a sanity trck of whether or not like this given a query and the final response whether or not even makes sense like you eyeball it right and then you like try to do some basic evals um and then once you like diagnose what the issue is then you go into the kind of like specific area to to find some more uh solid benchmarks and try to like improve stuff um so what is N2 evals like it's you um have a query it goes in through uh retrieval system you get back something you synthesize response and that's your final thing and you evaluate the quality of the the final response um and these days there's plenty of uh projects like startups like companies uh research uh doing stuff around like gp4 right as like a human judge to basically kind of like synthetically generate do well I mean I think it's too easy well I think oh oh you're talking about like the startups um yeah I don't know I don't know from the startup side I just know from the technical side I think I think people are going to do more of it um the main issue right now is just uh it's really unreliable like it it's just like there's like variance in the response when they won't do more of it I mean it's bad but these models will get better and you'll probably fine tune the model to be a better judge I think that's probably what's going to happen so I'm like reasonably bullish on this because I don't think there's really a good alternative Beyond you just human annotating a bunch of data sets um and and trying to like just manually go through and curating like evaluating eval metrics and so this is just going to be a more scalable solution in terms of startups yeah I mean I think there's a bunch of companies doing this in the end it probably comes down to some as of like ux speed and then uh whether you can like fine tuna model and then uh so that's end to EV vals and then I think like what we found is for rag a lot of times like uh what ends up affecting this like end response is retrieval you're just not able to retrieve the right response I think having proper retrieval benchmarks especially if you want to do production rag is is actually quite important I think what does having good retrieval metrics tell you it tells you that at least like the retrieval is good it doesn't necessarily guarantee the end gener generation is good but at least it gives you some sort of like um uh sanity track right so you can like fix one component while optimizing the rest what retrieval like evaluation is pretty standard and it's been around for a while it's just like an IR problem basically you have some like uh input query you get back some retrieve set of context and then there's some ground truth in that ranked set and then you try to measure it based on ranking metrics so the closer that ground truth is to the top the more you reward the evals and then the closer it is to the bottom if it's not in in the retrieve side at all then you penalize the EV vals um and so that's just like a classic ranking problem most people starting out probably don't know how to do this right now we we just launch some like basic retrieval evaluation um modules to help users do this uh one is just like curating this data set in the first place and one thing that we're very interested in is this idea of like synthetic data set generation for evils so how can you given some context generate a set of questions with gbt 4 and then all of a sudden you have like question and then context Pairs and that becomes your ground truth yeah are data agent evals the same thing or is there a separate set of stuff for agents that you think is relevant here data agents add like another layer of complexity because then it's just like you have just more Loops in the system like you um Can evaluate like each Chain of Thought Loop itself like every llm call to see whether or not the input to that specific step in the train of thought process actually um uh works or or is is correct or you could evaluate like the final response to see if that's correct this gets even more more complicated when you do like multi-agent stuff because now you have like some communication between like different agents like you have a top level orchestration agent passing it on to some uh lowlevel stuff um I'm probably less familiar with kind of like Adrian eval Frameworks I know they're they're starting to be become a thing um I know I was I was talking to like June from the Dr of Agents paper U which is pretty pretty unrelated to what we're doing now but it's very interesting where it's like do you can kind of evaluate like overall agent simulations by just like kind of understanding whether or not they like modeled the distribution of a human behavior but that's like a very macro principle right and that's very much to evaluate stuff to kind of like model the distribution of things um and I think that works well when you're trying to like generate something for like creative purposes but for Stuff where you really want the agent to like achieve a certain tasks it really is like whether or not it achieved the task or not right because then it's not like oh does it generally mimic human behavior it's like no like if did you like send this email or not right like because otherwise like this this thing didn't work yeah makes sense awesome um yeah let's jump into lining round uh so we have three question acceleration exploration and then one final takeway the acceleration question is what's something that already happened in AI that you thought would take much longer to get here I think just the ability of LM to generate believable outputs um and and both uh for text and also for images and I think just um the the whole reason I started hacking around with Alps honestly I felt like I got into it pretty late I should have got into it like early 2022 because j3 had been out for a while like just the fact that um there was this engine that was capable like reasoning and no one was really like tapping into it um and then the fact that uh you know I used to work in image generation for a while like I I did Gans and stuff back in the day and that was like pretty hard to train you would generate these like 32x 32 uh images and then now taking a look at some of the stuff by like Dolly and and you know mid journey and those things it's it's just it's it's very good yeah exploration um what do you think is the most interesting unsolved question in AI yeah I'd probably work on some aspect of um like personalization of memory I think a lot of people have thoughts about that but like for what it's worth I don't think the final state will be rag I I think it'll be some some like fancy algorithm or architecture where you like bake it into like the the architecture of the model itself like if if you have like a personalized assistant that you can um talk to that will like learn behaviors over time right and and kind of like learn stuff through like conversation history what what exactly is the right architecture there I I do think that'll be part of like the continuous fine tuning yeah like some aspect of that right right like these are like I don't actually know the specific technique but I don't think it's just going to be something where you have like a fixed Vector store and that that thing will be like the thing that restores all your memories yeah it's interesting because I feel like using um model weights for memory it's just such an unreliable storage device I know but like I I just think uh from like the AGI like you know just modeling like the human brain perspective I think that there is something nice about just like being able to optimize that system right and to optimize a system you need parameters and then that's where you just get into the neural net piece cool cool uh and yeah takeway you got the audience ear what's something you want everyone to think about or yeah take away from this conversation and your thinking I think there were a few key things uh so we talked about two of them already which was SEC insights which if you guys haven't checked it out I would definitely encourage you to do so because it's not just like a random like SEC app it's like a full stack thing that we open source right and so if you guys want to check it out I would uh definitely do that it provides a template for you to build kind of like production grade rack apps um and we're going to open source like and module more components of that soon and do a workshop on it um yeah and and the second piece is we are thinking a lot about like retrieval and evals um I think right now we're kind of exploring Integrations with like a few different partners and so hopefully some of that will be uh released soon and so just like how how do you basically um have an experience where you just like write L index code all of a you can easily run like retrievals evals and like traces all that stuff and and like a service and so I think we're working with like a few providers on that um and then the other piece which we did talk about already is this idea of like yeah building like rag from scratch I mean I think everybody should do it I think um uh like I would check out the guide if you guys haven't already I think it's in our doc but instead of just using um you know either the kind of like uh the retriever query engine and L index or like the conversational QA train and and Lan train it's um I would take a look look at how do you actually chunk parse data and do like top cam batting retrieval because I really think by doing that process it helps you understand the decisions the prompts the language models to use that's it thank you so much thank you Jerry yeah thank [Music] you

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Latent Space · Latent Space · 5 of 60

← Previous Next →

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Ep 18: Petaflops to the People — with George Hotz of tinycorp

FlashAttention-2: Making Transformers 800% faster AND exact

FlashAttention-2: Making Transformers 800% faster AND exact

RWKV: Reinventing RNNs for the Transformer Era

RWKV: Reinventing RNNs for the Transformer Era

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

RAG is a hack - with Jerry Liu of LlamaIndex

RAG is a hack - with Jerry Liu of LlamaIndex

The End of Finetuning — with Jeremy Howard of Fast.ai

The End of Finetuning — with Jeremy Howard of Fast.ai

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Four Wars of the AI Stack - Dec 2023 Recap

The Four Wars of the AI Stack - Dec 2023 Recap

The State of AI in production — with David Hsu of Retool

The State of AI in production — with David Hsu of Retool

Building an open AI company - with Ce and Vipul of Together AI

Building an open AI company - with Ce and Vipul of Together AI

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Making Transformers Sing - with Mikey Shulman of Suno

Making Transformers Sing - with Mikey Shulman of Suno

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Why Google failed to make GPT-3 -- with David Luan of Adept

Why Google failed to make GPT-3 -- with David Luan of Adept

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Breaking down the OG GPT Paper by Alec Radford

Breaking down the OG GPT Paper by Alec Radford

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

LLM Asia Paper Club Survey Round

LLM Asia Paper Club Survey Round

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How AI is Eating Finance - with Mike Conover of Brightwave

How AI is Eating Finance - with Mike Conover of Brightwave

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

State of the Art: Training 70B LLMs on 10,000 H100 clusters

State of the Art: Training 70B LLMs on 10,000 H100 clusters

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Synthetic data + tool use for LLM improvements 🦙

Synthetic data + tool use for LLM improvements 🦙

RLHF vs SFT to break out of local maxima 📈

RLHF vs SFT to break out of local maxima 📈

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Answer.ai & AI Magic with Jeremy Howard

Answer.ai & AI Magic with Jeremy Howard

Is finetuning GPT4o worth it?

Is finetuning GPT4o worth it?

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Building AGI with OpenAI's Structured Outputs API

Building AGI with OpenAI's Structured Outputs API

Q* for model distillation 🍓

Q* for model distillation 🍓

Finetuning LoRAs on BILLIONS of tokens 🤖

Finetuning LoRAs on BILLIONS of tokens 🤖

Cursor UX team is CRACKED 💻

Cursor UX team is CRACKED 💻

Choosing the BEST OpenAI model 🏆

Choosing the BEST OpenAI model 🏆

How will OpenAI voice mode change API design?

How will OpenAI voice mode change API design?

STEALING OpenAI models data 🥷

STEALING OpenAI models data 🥷

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

Prompt Engineer is NOT a job 📝

Prompt Engineer is NOT a job 📝

Prompt Mining LLMs for better prompts ⛏️

Prompt Mining LLMs for better prompts ⛏️

The six pillars of few-shot prompting 🔧

The six pillars of few-shot prompting 🔧

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Can you separate intelligence and knowledge?

Can you separate intelligence and knowledge?

The video discusses RAG and its application in LLMs, with a focus on LlamaIndex, a toolkit for building and optimizing RAG systems. The speaker covers topics such as fine-tuning, information retrieval, and evaluation, and provides insights into the development and application of RAG systems.

Key Takeaways

Build a RAG system using LlamaIndex
Fine-tune an LLM using Open AI endpoints
Evaluate a RAG system using retrieval benchmarks
Implement a vector store for RAG
Craft effective prompts for RAG systems
Optimize prompt performance
Understand language model decisions
Design and implement a RAG system
Use LlamaIndex for building and optimizing RAG systems

💡 RAG is a hack that works well by stuffing information into the prompt, and fine-tuning is needed when combining RAG with a retrieval data structure mechanism.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Bloom Filters, Explained Properly

Learn how Bloom filters work and their benefits, including tiny memory and blazing speed, in exchange for potential false positives.

Dev.to · Daksh Gargas

Prefix Sums: The Preprocessing Trick That Makes Range Queries Instant

Learn how prefix sums enable instant range queries in arrays, boosting performance in various applications

Medium · Programming

I Thought I Was Ready for the Interview — Then One Simple Math Question Destroyed Me

A simple math question can destroy a developer's interview, highlighting the importance of being prepared for unexpected questions

Medium · Programming

Week 2(Day 10): LeetCode Two Pointers(slow & fast): Remove Duplicates from Sorted Array (Brute…

Learn to remove duplicates from a sorted array using the two pointers technique, improving from brute force to optimized solutions

Medium · Python

Chapters (23)

Introductions and Jerry’s background

4:38 Starting LlamaIndex as a side project

5:27 Evolution from tree-index to current LlamaIndex and LlamaHub architecture

11:35 Deciding to leave Robust to start the LlamaIndex company and raising funding

21:37 Context window size and information capacity for LLMs

23:09 Minimum viable context and maximum context for RAG

24:27 Fine-tuning vs RAG - current limitations and future potential

25:29 RAG as a hack but good hack for now

28:09 RAG benefits - transparency and access control

29:40 Potential for fine-tuning to take over some RAG capabilities

32:05 Baking everything into an end-to-end trained LLM

35:39 Similarities between iterating on ML models and LLM apps

37:06 Modularity and customization options in LlamaIndex: data loading, retrieval, syn

43:10 Evaluating and optimizing each component of Lama Index system

49:13 Building retrieval benchmarks to evaluate RAG

50:38 SEC Insights - open source full stack LLM app using LlamaIndex

53:07 Enterprise platform to complement LlamaIndex open source

54:33 Community contributions for LlamaHub data loaders

57:21 LLM engine usage - majority OpenAI but options expanding

1:00:43 Vector store landscape

1:04:33 Exploring relationships and graphs within data

1:08:29 Additional complexity of evaluating agent loops

1:09:20 Lightning Round

Stump Grinder Carbide Wheel Grinds Hardwood To Chips

Innoforge Studio