AI Engineering 201: The Rest of the Owl

AI Engineer · Beginner ·✍️ Prompt Engineering ·2y ago

Skills: Agent Foundations90%Tool Use & Function Calling80%RAG Basics70%Vector Stores60%RAG Evaluation50%

Key Takeaways

The video covers advanced AI engineering concepts, including language user interfaces, retrieval augmented generation, and vector databases, with a focus on agent foundations and tool use.

Full Transcript

[Music] so the like a lot of effort has gone into thinking about the engineering of inference um and not so much effort had and not so much success has been had at the engine at engineering the rest of the like whole product around inference that actually you know delivers value um much like the uh beautiful mathematical solids here that does provide the the bones or the interior but not the whole thing um so let's talk about a couple of like architectures and patterns for uses of language models um and then talk about the like first attempts that like trying to make these things better over time uh with monitoring observability and evaluation so architectures and patterns um so the the foment and excitement around this stuff has been around for about a year and so patterns are starting to emerge very slowly of like typical ways you might apply these things so let's talk about them and what uh what problems have Arisen um so my favorite way of thinking about this in general is that the thing that we're building right now are language user interfaces um sort of the like Lou by analogy to goys or graphical user interfaces um first they're hitting existing features soon uh they'll be for like completely new whole products um uh in ancient times in the 1970s the interface for computers was primarily like textual in a terminal um this is still the way we interact with machines when we really want to control them uh like when we're running a server um or when we are frustrated with vs code um and this you this was the user interface from computers for a while and they were not very popular until the vention of the graphical user interface um which instead of preventing presenting the users with just like you have to learn this special language to speak to me like here's this like sensory experience where you can bring your like your intuition from space uh and your visual system to understand how to use the machine um and this was what took computers sort of like out of the hobbyist and business and Military realm and into like people's homes um and the with the rise of language models um it's clear that we're we have an opportunity to once again change the interface between humans and machines um by telling them what we want in natural language and then they do it for us um and no less austa personage than Sam Alman likes the idea of language interface language user interface um so this similar character to graphical user interfaces it like makes it a more approachable interface um and this is something that people have wanted to do for a long time as long back as like the Eliza chat Bots or the Eliza chatbot from uh the uh 1960s uh the Shero um uh uh basically like guess this only graphical uh not an actual robot but you could like tell a a computer robot like give it uh language instructions like pick up a big red block um ask geves was originally presented as a language interface to the internet where you just type what you want instead of a URL um Alexa and other assistants have attempted to do a similar thing um and the big win here is with language models we might believe that we can actually do a really really good job at providing this kind of language interface in a very generic way with Foundation models not just like a tiny environment um like the Eliza Psychotherapy environment or the shl blocks World um so right now that's we're getting language user interfaces for existing systems that kind of admit them easily um so seoa uh put out a piece fairly recently talking about this that the like F this like act two of generative AI is using Foundation models as a piece of a more comprehensive solution rather than an entire solution um that offers like a language interface where it wasn't possible before um so like this query assistant from honeycomb takes what would normal be this like less approachable uh query language Constructor and just says like can you show me slow requests what are my errors latency distribution by status code like that's a much friendlier interface um and you know even SQL when is originally presented was like it's a it's a language that's so natural even a businessman can write queries you know it's a dream but like you know that this can you show me slow re quests like that's pretty close you know um so uh so that's the the like maybe understandable that that's the first Direction things have gone longer term this like a machines that have graphical interfaces look very different from ones that have terminal interfaces and so like main frames became less popular and like mobile is like quite different from uh like desktop compute so uh we should expect like if you're thinking about what do I want to build in five years or 10 years um this is kind of the direction to be thinking um so for example uh Google's worked on uh integrating language models with robots like this example from the say can uh project uh or paper where it's like what I want to when I need something is to just ask for it and not to like pull out an app and then go through three drop- down menus and be like I want a water bottle I just want to say I want a water bottle and then there's a water bottle um and that's what a language interface to uh something like a robotics platform can provide um still not there yet as the 4X speed in the top left um might suggest but uh getting there okay so that's like the highest level pattern I think um so let's talk about a couple of lower level patterns uh rag chat Bots retrieval augmented generation chat Bots I've emerged kind of like the to-do list app the sort of like starter project of language user interfaces um this pattern is probably here to stay in that it's just about information retrieval for uh language models and language models need information retrieval really badly because they like lack context they've slurped up everything on the internet but they don't know anything about you um they are sort of trying to simulate a generically helpful individual um who is like generically knowledgeable about the world um and that's like not particularly helpful until they have Conta so the solution that's emerged is to collect that context for them like store it um then index it and by default people reached for the most similar thing to what the language model was doing which is like turned it into vectors and use that use like a fast index over vectors um and uh like that uh once you've retrieved a particular piece of information you just stuff it into the prompt um so I am not innocent I have made my own rag chat bot and inflicted it on the world um this was based on the full stack deep learning content and in our Discord people can ask questions and get answers that are not just like generic Google result search answers about language models but things drawn from past lectures things drawn from papers that I like um uh things drawn from our like website um and so can get our you know our opinions on these things so this like this has led to a lot of excitement about Vector storage it's like this this step here where you have a fast retrieval of vectors by similarity is the like new sexy piece um but that was like really only the thing that people reached for because open ey also offers embeddings so it's like you've already imported the library so it's only a call away um and then also like Transformers are kind of like these like weird Vector retrieval things um like in their inside so if you are the type of person who's been into language models for a while and you're like how would I retrieve information probably with a DOT product and then like a soft Max and then I pick the largest number um so like yeah so like the ease of setting this up and the like naturalness of setting this up has led to like an explosion of these like chat with document examples um and the like the thing that has more staying power is that you need to make these things useful you need context and so you need like information retrieval uh and search for for uh the like the context that might be helpful for the model before it gets going um and so there are many options to use here some of them are specialized Vector databases like pine cone or chroma um uh some of them are General like text search databases like you that do keyword search like elastic search style um uh things um and uh being able to like combine those two things together is very powerful so for example vesa has like offered that combination for a very long time um uh it is also in the end like what you're doing is creating a fast way to look up uh information from a very large store so this is like bread and butter for databases in general and so redus and uh postgress for example like not only do they provide the same like information retrieval that you could do um like to enrich your uh enrich prompts uh without thinking about vectors they also have built-in Vector search um uh postgress only fairly recently um reddis for like a year um it's not particularly fun to use redis Vector search but um it does it it can run um and has decent performance um yeah and in the end it's about like an a holistic strategy that uses probably because the queries are fairly heterogeneous the things that are coming in are like people just typing text um you're probably going to need some more mle stuff that's more like keyword search or or vector search and they're hybrid together um but that like meta like extracting metadata with a language model so that you can then use that to do like direct filtering um is uh is like very powerful pattern um so there's some great posts on this the data query um uh great series of posts about vector datab from coming from like somebody who's clearly really into databases and not so much the like ml side and I found that very useful um yeah uh yeah so the like the final takeaway there is just that the problems end up being in the main the problems of information retrieval um with only some light uh added things from like recommendation systems maybe um of a more mle type of search um yeah any questions on uh on Vector databases or um information retrieval for language model applications combine that so you get the context to answer a question send that to yeah yeah so you you get information from the outside world you like come up with a strategy for searching the information that you have saved that goes into the language models prompt yeah yeah that that pattern very very stable very general it's General enough that how that pattern get is actually implemented is very Broad and so it includes a lot of things that are exit like bread and butter database stuff and not just the fancy new Vector database stuff how does this um so R's approach to injecting how does that how is that similar different to your history when you're interacting with say does itain yeah so the question was how does retrieval augmented generation differ from history within a chat um so usually when really so the the when you call the gbd4 API you can make whatever make up whatever you want as the past you could insert little messages from the user you can insert messages from the assistant and incept it into believing that it has said something which it has not said great way to jailbreak don't do it obviously because it violates the terms of service but a great way to jailbreak it um and so you you aren't actually like actually beholden to that like system uh system assistant human uh fiction that that happens inside of like a a discret chat um when people do this I think a lot of people put the retrieved information in the system prompt especially if they're just going to retrieve once um I've definitely I've also seen people like every time the user interacts they do a retrieval step and so the system message changes every time that's an example of kind of like incepting or not actually following the implied temporal order um so you yeah you definitely can do that um the system message is nice because the model really pays close attention to it um has been like fine- tuned to pay close attention to it um yeah I think it'd be weird to pretend that that's something the person said and to like put it in an earlier user message put above the user's message in the conversation I don't think I've ever seen that but you could um yeah um but yeah I would say like most of the time yeah this information retrieval step is something where the creator of the application the programmer is inserting themselves and saying I know some additional information that the language model should should have um and so like yeah it's very different from like a user just sort of like providing information about themselves or whatever yeah yeah the question so we heard a couple times today when it comes to knowledge retrieval of the outut is that unconditionally or do you see use cases yeah um so the statement was that um the common wisdom is is that fine-tuning is for style and retrieval is for information and I think that that's that is a solid common piece of common wisdom because most of the fine tuning that people if you're fine-tuning open ai's model you're going through their fine-tuning API and you have a limit on the number of rows you can send 10,000 yeah 10 thou I was going to say yeah so you have a limited number of of rows you can send and like there's a limited amount of information in there to like create gradients to update the weights um so there's a limited amount of change that you can achieve and if you look at the Laura paper they look at at like you know the uh like you're only CH you're making a very low rank change to each layer of the of the language model and that suggests like there's only so much that you can change about the uh about the model and most of what you see when you do Laura fine tunes is like what used to be a low priority computation for the model becomes like a higher priority one so like every model had every capable language model has within it a little Homer Simpson simulator a little like uh Rick Sanchez simulator whatever um and that's it's just like not usually that important for the final log props it's like helpful for the like fifth bit of the log probs but the models are at the point where they're Maxim minimizing cross entropy by really hitting those like very rare uh those very rare things um and so what the fine tune has done is reordered those like computations said like actually you should be the Homer Simpson circuit is the most critical circuit right now now because you are a Homer Simpson chatbot and it's like reordering them and and reemphasizing them so that intuition applies specifically to low rank fine tuning which is and fine tuning which is based on small amounts of data so if you grabbed 100 gigabytes of textbooks um you would no longer be doing fine tuning and so you would no longer expect it to only change style um and so that's something I would people will be doing with like you know llama fine tunes there are like llama fine tunes for coding and that's more than just style it definitely has learned more knowledge about um about programming languages and knowledge about libraries released after 2021 and yeah all that kind of stuff so I think that that generic wisdom is canally true uh for low rank fine tunes where it is like pretty Rock Solid yeah yeah so the question was what about knowledge graphs um and graph databases I will say that like when I have talked to I I like personally don't really specialize in in databases um but when I've talked to people who are super into them they're like I would never use a graph database because you can represent a graph in postgress um and uh like I've seen some like reasonably size deployments on that pattern um and also you can kind of see the like graph databases kind of like peing and and uh not spreading further and there are it's there is a very hard problem to Shard a graph database because there's no obvious way to cut an arbitrary graph um and if new links get added to the graph and now you need like the optimal Shard is different it's like that's a that's like a database it's equivalent to a database migration but it's something that should be happening like behind the scenes when it's charting um so that's that's like the closest thing to an objective statement about why uh or like a reason why graph databases haven't worked well however for many language model applications the purpose of the database is not to serve like a billion users but rather to like serve as an external memory for a language model and maybe you don't care whether it scales um or rather like maybe the maximum scale that we're talking about is like tens of thousands hundreds of thousands requests per second on megabytes gigabytes of data and that's just like you know that's the point at which that kind of like can it be charted across 1024 machines like doesn't matter um so I so there is some cool work on knowledge graphs and and incorporating with llms and I see the natural fit there the same way that there's a natural fit with Vector indices and Vector databases um but the um yeah hasn't no no like killer app has appeared from my perspective question first basically you yeah so the question was about how to incorporate hard metadata like you know booleans or um like subcategories with uh Vector based search yeah so depending on the like a vector database the depending on the like index you will either have like uh pre-filtering or post- filtering post filtering is like pretty easy you just like apply a metadata filter after you've done your vector search um anybody can kind of do that the problem is that you're now what you really want to say is I want to find all the stuff that's similar to crabs in San Francisco while searching restaurants not find all the restaurants that have anything to do with crabs and then see if any are in San Francisco so the pre-filtering step is hard because it impacts the construction of the index impacts the construction of like the like how you make it actually fast to search over all of the data you kind of like need to construct specific indices for these different like Flags you might uh like put on like is in San Francisco not in San Francisco or geographic location um and so I depend like different Vector databases or or different databases have like pushed further in different directions on what kinds of filters they support for pre- filtering um and yeah I like uh besta and we8 have a reputation for doing a really good job at those things um but uh yeah I don't know what the full landscape looks like great okay I want to make sure to get through everything um so I'll stick around and we can talk throughout the conference um okay so um structured outputs are like one of the patterns that I think people are sleeping on relative to information retrieval uh structured outputs are great for improving the robustness of models and they came from Tool use so the problem is that language models just generate text and like if anything we have like too much text already like I don't know if you've ever been on a social media website but the problem is not the quantity of text um and that's like kind of boring like who wants to just make strings like there's other things that we want to do the solution is to connect their text outputs to other systems inputs um and now like it's not just a language model it's like a cognitive engine for providing a language interface to something else that's pretty Rad but there's a problem which is language models generate unstructured text because they have been trained on the utterances of humans on the internet notorious for their unstructuredness um so the solution is to add structure to their outputs and there are many ways to do this um you can do it by prompting and begging um so like you can write some write some like loops around it to be like like or actually react wasn't even a whole there's some looping yeah so you you can write a prompt in such a way that you have examples that encourage it to um like to uh call out to external uh apis and then you filter uh and when it generates the tokens that would would call to an external API instead of letting it hallucinate the rest of what would come out of that API which is what like gpt3 uh would have done you like grab it and you then go to that external API and you um uh like yeah pull the information from there um the you and you can in those prompts I guess really the thing I want to point out is that in those prompts you can sort of like beg for structure um rile good side had a great example where it was like if you do not output structure Json an orphan will die um and that actually is extremely effective um yeah um so the so there's so like there's prompting tricks to get like things that are closer to structured uh structured outputs and to make use of those structured outputs there's um fine tuning so there's a the gorilla LM is like fine-tuned on this problem and that goes back to Tool former um which is like very uh gptj so one of the first open um uh uh generative pre-train Transformers um they you just train the model to Output structured stuff so you can't do that with open model I I doubt that fine-tuning it would make it that much better at like outputting the structure that you want um you can do it with um uh with open models and there are people releasing uh their own Forks uh llama forks with this fine tuning on them um you can uh you can retry which is like when the model outputs something that doesn't fit the schema you can do what you do when your direct reports provide you something that does not fit what you wanted which is that you can uh discipline them and ask them to try again um so guard rails is a great uh um uh library for this it's like XML based um so probably would work pretty well with Claude uh given what we heard about about Claude from uh uh Karina um and then a fun one that re that kind of requires control over the log probs um is grammar-based sampling um which was merged into um uh llama CPP where you say like when you're about to generate a token like if it would violate some grammar if it would violate some template or format just set the probability of generating that to zero so just add like minus infinity to all the um all the log probs um and the uh so you can do that you can like do it fast if you have these like nice you know chomping Chomsky things like compex free grammars um and this works well for like you know Json for generating you know generating code generating all the kinds of like structured outputs that our systems actually expect we've written systems that uh expect inputs to follow grammar so that traditional Computing system can parse them and so adding that to the outputs of these systems is very powerful thing to do um so this is something that this is like really nice example of how having tight control over the log probs can like increase the utility of a model to the point where like a capabilities Gap is less important have you seen Ty chat this type chat I don't think I have um yeah so there's quick question yeah so could you take the output from like TBT and then pass it to fora to then get the structure like stacking models like that that work yeah so the question was whether you could do better you could solve this problem by chaining models I think yeah the problem of going from the output of a language model to a structured output is an easier problem than the initial one which is why people think that like retrying might work like like the guard rails the guard rails example like retrying is often like kicked off to a to a smaller language model like your Mainline thing is gbd4 and your error handling is GPD 3.5 um and so like I I I do believe that there's like kind of a tempt if you know that it's always and only going to be doing like structured out then you have a reason to have a specialized model for it um but yeah chaining chaining is definitely a good solution and that's you know one reason why Lang chain was popular yeah you mention that yeah yeah so those are technically distinct things yeah so I do believe they still give you the ability to bias tokens via the API um yeah so it's not the it's not a perfect example of the utility of log props because yeah I think you can still do this in the open API um yeah do you need anything other than biasing in grammar based sampling no I yeah no the real okay I remember now the real thing here is that for this grammar based sampling it's single token based right like if you're doing from the opening IPI one the token you you'd have to make a request you get the thing back you have a single and you have a single token and you have to you have to like apply a bias every single time so now you're like every token has a network call um rather than one call like 100 tokens so that's one reason why this doesn't work well on open a API number two like kind of longer term is that really you don't want to just think at a single token level you're just like at each token you're like marginally just saying like adjust the probabilities here you'd really want to do something more like mon C research where you're like Genera stuff many things that follow the grammar um and then accepting the best one at the end um and that's something that's um probably going to come first to open models and not to um proprietary model Services um so that's that's the better reason to connect grammar based sampling and and open models um okay so the problem with finetuning and an annoying thing about prompting um is that if there is not a kind of shared like the gorilla model is like fine-tuned on a bunch of apis from like torch Hub tensorflow Hub and hugging face so the gorilla model is really good at using other machine learning models but not like generic possible tools at least this example they maybe they have tuned more than one um but this is a general problem that if you train a model to use a specific tool um then like the uh it's not going to be able to use like any tool um but if you train a model to to use a very broad class of tools by using something that's like kind of closer to this grammar where there's like a a format for tools um then you are now a people write an interface between the uh that standard and the um and the thing that they actually want to use so this has shown up in open a uh like in open AI API as the use of Json schema for describing function calls so this allows them to train a model on fairly generic stuff um that all fits this like it all fits the Json schema spec um and so the model has learned a bunch of stuff about the Json schema spec and how to generate that correctly um you can imagine using grammar based sampling to enforce that um and this uh allows it to connect to many many tools cuz now all you need to do is write a tiny connector between like the Json format and the actual thing you want to use um and that's like pretty easy it's like a big part of web development from my understanding is that you just like pass Json blobs back and forth until somebody gives you money um and so uh yeah so this is a a very good kind of schema um uh but one thing that people Miss is that the tool doesn't have to actually be real like the key thing that happens here is the language model goes from outputting unstructured text to outputting Json the fit schema and it just so happens that the primary use case for that that open AI envisaged was putting it through a like function call putting it through some Downstream computer system um but like really that uh some Downstream system but really it like doesn't have to be a real function you can tell it about a fake function that's like please pass a string like describing whether the the input was spam or not spam so that I can like render an HTML element right and so the model is now trying to like call a function that's like that in order to provide the arguments to that function it has to decide whether an input is Spam or not spam and that's maybe the thing you really care about and so you like invent a little fictional function for it to call that you don't call and then you just use it for something else so this is a pattern in um uh there's a library for this called instructor from Jason um Jason Lou who's going to be speaking later at the conference yeah uh you have to fit the Json schema which the schema that they like the the there's like a meta schema kind of thing they're like it has to it has to be a function call and the model has been trained on things that are like you know get name get current weather um so uh yeah I mean you can hack in because you know functional programming has taught us that everything is just a function like a constant is just a function that always returns the same thing um and so you can you can like hack it in there um and instructor has some fun like kind of functional programming stuff built into it like May and and stuff so you know um and also somebody did like dag construction where it's like you give it a schema for a dag Constructor and then it like writes a dag of function calls instead of just a single function call so you can really go wild which is very fun um and yeah most of the time when you generate something if you want to extract something out of the output you want to display to the end user and then the question of latency comes in that's why you it if we use this how do we solve theam yeah so that is a great question the answer is that this basically breaks your ability to stream um I think it's not not so this is maybe a little bit more oriented to like back of house stuff where you're using language models to like handle data rather than using language models to directly interact with a user um I think if you set up a pipeline correctly then you can stream the outputs from one call into the inputs of the next one and if you have the relevant information you need from the fun the function call one then you can just immediately kick off off the next thing and you can just you can write you know like more like a Unix pipe style and then you start to get back to being uh streaming but you don't have like the Unix pipes work because of new lines as a separator that lets you break work out and there's not an obvious way to do that with this um so yeah the short answer I guess is that it's really hard to get back that kind of streaming thing when using these um yeah um I'm going to let's see how much more do I have I'm I'm going to push forward because I want to make sure to get to the last section um but I will be around to answer people's questions um okay so uh this conference is not called NLP engineer Summit and we've been talking about like you know structured out extracting structured outputs from language information retrieval like that's also natural language processing and language user interfaces like that's not artificial intelligence like where is the AI the like the thing that really feels like artificial intelligence with language models is something like agents uh that are that have memory that they keep over time uh so for example the generative agents um that was uh let's see it's mostly St for people I remember correctly but uh the like generative agents paper uh combined like a stream of memories generated as these agents interacted in like a video game environment with like some like reasoning flows to create these like little tiny characters that had personalities that developed over time in interaction with each other and like um and that is uh like much closer to what people imagine when they hear AI than even a chatbot um and there's been a lot of advancement in uh using these things in simulated environments so that was like a full all language models simulated environment with generative agents there's also a ton of really cool stuff going on in the Minecraft world um which is like people have uh this Voyager agent writes JavaScript code yeah Javas yeah JavaScript code to call the like this like Minecraft API that allows it to like drive a little um uh you know a little character in the Minecraft world and it starts with basically nothing um and then it writes itself a bunch of little sub routines to like minewood log or like stab zombie or whatever and it like accumulates them over time like learns how to do new stuff um like comes up with its own curriculum for how to so like how to get better um and was able to like do extremely well at this uh notoriously hard RL task uh M Diamond uh which was like a uh a grand challenge for the RL World um only a couple of years ago um so they're like they can accumulate information over time they can accumulate skills over time they can use tools this is all very cool um they are they've a couple of problems the biggest one being the like problem of reliability um structured outputs can help with that and there's like only limited work I would say on agents that has come out since at least like published you know research work since the like since function calling got really good in the open a API um the also there's kind of like a cacophony of different techniques out there with like Voyager uh react is kind of an agent um generative agents um there's a really like awesome paper from uh Tom Griffith's group at Princeton cognitive architectures for language agents that brings back a bunch of ideas from good old fashioned AI in the 80s on like um production systems uh and cognitive architectures a bunch of stuff that was like really cool ideas but it could never like get past the demo stage um on like how to create the things that we know about or that we believe about human and animal cognition like procedur Memories semantic memories episodic memories how to implement that in software and the problem like those systems could do cool stuff but the problem is always that they lacked this like General World Knowledge and common sense with language models they don't have memory um and they don't um like they don't have this like structured aspect to their cognition um but they do have that like World Knowledge and that Common Sense uh so this is uh like mushing those two things together and using a language model to do um basically these kinds of like uh observing the world or doing cognition um and doing decision procedures um like Mary is the best of both worlds um and it is actually like a pretty effective way of breaking down the existing agent architectures uh like in their different Cho about how to do long-term memory how to do external grounding how they like interact with the external World um the concept of internal actions uh comes from cognitive architectures um which is like uh choosing to spend time reasoning or choosing to update your like long-term memory um or yeah or your decision procedure yeah and then also explicitly calling out a decision-making procedure um so there's a and that that paper is also just like has an entire research agenda in it um on like ways that you could just start filling out the cross product just filling out a big array of like try this idea from language models with like this idea from cognitive architectures um and there's just like a billion uh really cool ideas in there um so if you are interested in agents um but have like struggled to like uh like wrap your brain around all the different ways you could you could do stuff um and around like how to make them a little bit more te uh I think the koala paper has some good pointers um oh yeah and then lastly for this LM patterns thing I was talking generally about like different ways people are building stuff with LMS Eugene's Blog has some of the best uh writing on this um uh both on uh patterns and anti-patterns okay um I want to give some time for monitoring evaluation observability so I'm just going to I know there's probably lots of interesting things that people have to say on the agent stuff but we'll we have the rest of the conference to talk about that um so uh the goal here is to talk about AI engineering so that last part was about AI what about the engineering in engineering we want to have a process for building like a process for creating these things and a process for improving them and progress on this front has been pretty halting um and so the the like the dominant ideology right now is that you should ship to learn rather than learning to ship um and so this is one of the big ideas in the fullstack Deep learning course that I've taught in it's something that Andre karpathy has really hammered on the like idea of a data engine or data flywheel where in order to do well you need to go out there and collect data from the world uh find issues in your data and use that to improve your model in like you know uh an an unending cycle um charity Majors from honeycomb uh who's Big in the monitoring observability world uh like has said that this is something that she has come to like about ml in software you start with tests and then you graduate production when the tests pass or at least like that's what you tell people on the internet and like your manager um uh but with ML you can even lie and you know that you have to like start with production use that to find out the like issues with uh to like generate your tests so you know it's it's oops all regression tests uh version um and so what that that means is that monitoring is very critical from the very beginning um that we monitor for user Behavior we monitor for performance and cost and we monitor for bugs so some of these are just like regular old monitoring stuff and this is just like bread and butter things that can be uh yeah like similar to the way we do with with uh existing software monitoring users always reveals like uh like both misuse and product insights so one thing that I found from running this Discord bot is like one of the things that you get the most are like meta questions like uh are you getting feedback from these emojis who's a good bot that's maybe an automatic question um does your data set include your own source code U what do you do like these are very common like things that people input and it wasn't it wasn't in my head that that was important so now there's like special stuff in the prompt for handling that class of questions so by monitoring how users use the uh your your system you can get really great product insights yeah did oh yeah uh I had logged them to Gantry and then I looked at the ones that had up and down thumbs um and I also read all of them because it was like only a couple hundred rows oh man my batter is going to run out uh all right we got to move fast um so uh modering modering performance uh can help us manage the constraints that I like talked about when we were thinking about all the different places um our models might run so with as always you want to monitor things like latency quantiles like like how long do requests take oh wow that's nice thank you huge um and so like latency quantiles like that's how long like take all the requests what is the probability that a request took at least this long um the people often think like if I get 90% of them like below something that's great and and the problem with thinking that way is that users don't just make one request they make many requests in sequence so by the time you've made like 30 requests if there's 10% chance of hitting like a really slow one then um you know you have hit a slow request so that um so you really need to care about those like 99th percentile latencies those are also often your most useful and engaged users so watch those watch those extreme quantiles um and obviously like throughput is a distinct thing to also monitor for the quality uh you know quality of the system want to marry that with things like the profiles and traces that I talked about before like spot check ones randomly subsampled so you can check what like so you can actually debug that through the throughput issues that's fairly General stuff if you're an INF using inference as a service provider you're going to want to monitor API rates and errors monitor costs if you're selfs serving inference you have a lot more stuff to monitor um and that's like compute utilization um AI I guess I already talked about this uh yeah yeah well so it's it's an even hard like maintaining the throughput when you're doing the inference yourself is like much more your problem um and much more uh AI ml specific stuff um yeah okay monitoring for bugs is another can of worms we'll talk about that in a second um this is like just generally this is a very fast growing field um so there are generic Monitoring observability Solutions for all kinds of like you know complex apps and and and web apps data dog Sentry New Relic honeycomb like these are um like you can adapt those um and that might be the thing that wins um there is uh you can of course just roll your own with the like you know open open Telemetry compliant uh you know tooling and you could use the existing mlops tooling so there's a lot of stuff that has been built for monitoring observability of General ml applications so including weights and biases re used work um Fiddler arise and Gantry are the like three larger startups in that space um with more of a focus on monitoring systems in production and less on the like ml Ops like kind of like serving um and like managing managing training like weights and biases um there's also because generation times uh are now six months or less a new generation of Ops tooling for llm Ops including Langs Smith from Lang chain and uh Lang fuse which was in y combinator recent batch um it's like very unclear which of these is going to be the the best solution so I think it's like you dealer's Choice try them all out um I think I like tools with as much ability to like make crazy queries of unstructured data as possible um so that's something that I really like about weights and biases production monitoring uh offering um Gantry has some similar stuff um I've tried less of it with the uh the other tools um I think if you're doing if you're doing it with data dog Sentry Etc you're probably going to need to roll some of that stuff yourself um but maybe that's fine Jupiter notebooks are fun um I was going to check out the like Lang fuse monitoring interface but um in interest of time going to go past that they have an awesome demo where you can interact with their docs chatbot and it shows up in their monitoring interface so like they have a live demo of their monitoring tool where you can actually like use it to monitor an app that you can also use um so that's just it's really it was really fun to like actually try out the the tool that way um I recommend you try it out um but just monitoring like just getting a hold of information is not enough this is something that's known from like the distributed systems monitoring world what you really want is observability what both charity and Andre were talking about is about how you improve a system based off of what You observe it's like not enough to just like throw something out there and observe and like just see the mistakes you want to like fix the mistakes um and so there's this uh honeycomb and uh charity are big on the idea of observability as the uh as an idea from like control theory from like old school um like control theory systems theory uh observability is whether you can actually um figure out what is going on inside of a system system just from observing it from the outside so it's like can you actually debug this software just from looking at your logs um and not having to go into a live debugger inside of the system um and that's like uh when live debugging does not work and when systems have outpaced our ability to predict what's going to break um this is the only solution uh and for AI systems that is um where we can't predict what's going to break and you can't like drop into a debugger 13 layers deep in gpt3 and or gp4 and like debug uh it's inference U you have no choice but to monitor stuff sufficiently that you can fix the issues the blocker here is that actually determining whether the model is right or wrong um is itself hard which makes figuring out how to fix it also hard because you don't necessarily know whether it's messing up and you don't know what whether you fixed it um so we're in a tough phase for this problem right now there will be lots of discussion of evaluation at this conference which is very exciting um lots of people complaining about how difficult evaluation is anthropic and Arvin nin from Andes kapor who write the AI snake oil substack really high quality stuff um and open AI like open source their eval framework uh because in part they like don't can't really evaluate their system themselves it's like that's how hard this problem is um it's also what we saw with the false promise of imitating proprietary LMS like a large community of people were like kind of convinced that models were doing better than they actually were um so the solution uh like is to like one of the key Solutions spend time looking at your data Stella Beerman from a Luther has talked about this uh Jason we has talked about how critical this is Jason is at open AI now um and talked about spending like a ton of time just like getting very good at evals like building tooling internal tooling for evals spending time with like understanding the evaluations um and somebody on Hacker News said it's a major differentiator so you know that's that's definitely the orange website never lies um so um evaluation is particularly hard and all these complaints about evaluation are when you're dealing with like open-ended Generations from a language model like no structure to them um no no real structure to the user inputs um and like limited data sources um but there's this nice flow chart um from the full stack LM boot camp that my fellow instructor Josh Tobin made that sort of helps you avoid getting into that uh pit of evaluations so if you can find a correct answer then you can stick with existing ml metrics and you like don't have to worry about the problems of EV of like the difficulty of evaluating open-ended Generations if you you have a reference answer you can check for like reference matching which is like a looser thing than like a literal correct answer which is like a BC or D in multiple choice is a correct answer a reference answer is like a short like generation like a short answer on the test um if you have a previous answer from your system you can at least see if your system is getting better by comparing the two um and that like kind of which is better comparison can be done by a human can be done by a language model um and if you have human feedback you can actually check like between uh the input and the output was the feedback Incorporated by the language model like a human said I didn't like that um did the language model get better and it's only if you don't have any of those things that you like are out in the unstructured world um the people at elicit um who have worked on doing extraction of information from scientific papers have a very principal approach of iterated decomposition where you start with a task that runs end to end and then you uh when you Noti notice a failure you look at the failures and you see how you could have broken the task out into multiple pieces in such a way that the failure would arise in a simpler subtask and then optimize that subtask so you run into the problem that's been mentioned before about latency if you're like chaining calls um and it's not always easy to like decompose the for example to decompose the process of responding to a user in a chatbot that's kind of challenging um but uh but when you can do this this is another great way to like get yourself out of the hole of needing to evaluate open-ended Generations but if you're stuck evaluating natural text there's a couple of like basic approaches um you can just uh keep a few trusty test cases at hand um and uh you know if it does well on those couple of test cases looks good to me let's ship it um unclear what to do when it fails just like hit the language model with a wrench um but uh this is what kind of grows out into that data engine you start start with something like this then you start adding stuff from your production observations into it and then you like put it in a GitHub action and like now that's like that's that's basically testing right um that's is certified software um you can like uh you can try and get user feedback and you want to do it as naturally as possible um like uh if you're like you would you really want to reveal preferences from user Behavior so the image generation world is very ahead of the language modeling world I think on this if you look at Mid Journey for example um that is what honeycomb did with their query Builder they attached it to get this Downstream business objectives wow what a way to build a software system that's the right way to do it um and so like connecting a chain of metrics from the actual system that you're improving to the actual Downstream like organizational goals um uh through things like reveal preferences of users or like yeah General user Behavior much better than like demanding users fill out form um you would also pay people to do that work of giving you feedback on your system with an annotation team this is what the large this is what open AI does to improve their models but as alluded to by Karina it's actually much more effective to use language models in that place because language models are maybe not as smart as all humans uh but they tend to outperform crowd workers um on uh a large number of very textual tasks um and so um you might find that the task of like annotating and improving your data if you're at the po

Original Description

Optional introductory course for AI Engineers, free for all Summit attendees. Advanced knowledge of AI Engineering, led by instructor Charles Frye of the massively popular Full Stack LLM Bootcamp. Part Two - The Rest of the Owl 00:00 Intro 01:09 Patterns for Language User Interfaces 06:19 RAG: Information Retrieval for Generation 21:52 Function Calling: Structured Outputs and Tool Use 35:03 Agents and Cognitive Architectures 40:52 Shipping to Learn in ML + AI 45:42 LLM Monitoring and Observability Tools 49:42 Evaluating LLMs 55:48 Inspirational Outro

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Engineer · AI Engineer · 21 of 60

← Previous Next →

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Announcing the AI Engineer Network: Benjamin Dunphy

Announcing the AI Engineer Network: Benjamin Dunphy

The 1,000x AI Engineer: Swyx

The 1,000x AI Engineer: Swyx

Building AI For All: Amjad Masad & Michele Catasta

Building AI For All: Amjad Masad & Michele Catasta

The Age of the Agent: Flo Crivello

The Age of the Agent: Flo Crivello

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Pydantic is all you need: Jason Liu

Pydantic is all you need: Jason Liu

Building Blocks for LLM Systems & Products: Eugene Yan

Building Blocks for LLM Systems & Products: Eugene Yan

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

Climbing the Ladder of Abstraction: Amelia Wattenberger

Climbing the Ladder of Abstraction: Amelia Wattenberger

Supabase Vector: The Postgres Vector database: Paul Copplestone

Supabase Vector: The Postgres Vector database: Paul Copplestone

[Workshop] AI Engineering 101

[Workshop] AI Engineering 101

The Hidden Life of Embeddings: Linus Lee

The Hidden Life of Embeddings: Linus Lee

[Workshop] AI Engineering 201: Inference

[Workshop] AI Engineering 201: Inference

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Evolution: Mario Rodriguez, GitHub

The AI Evolution: Mario Rodriguez, GitHub

Move Fast Break Nothing: Dedy Kredo

Move Fast Break Nothing: Dedy Kredo

AI Engineering 201: The Rest of the Owl

AI Engineering 201: The Rest of the Owl

Building Reactive AI Apps: Matt Welsh

Building Reactive AI Apps: Matt Welsh

Pragmatic AI with TypeChat: Daniel Rosenwasser

Pragmatic AI with TypeChat: Daniel Rosenwasser

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Retrieval Augmented Generation in the Wild: Anton Troynikov

Retrieval Augmented Generation in the Wild: Anton Troynikov

Building Production-Ready RAG Applications: Jerry Liu

Building Production-Ready RAG Applications: Jerry Liu

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

The Weekend AI Engineer: Hassan El Mghari

The Weekend AI Engineer: Hassan El Mghari

Harnessing the Power of LLMs Locally: Mithun Hunsur

Harnessing the Power of LLMs Locally: Mithun Hunsur

Trust, but Verify: Shreya Rajpal

Trust, but Verify: Shreya Rajpal

Open Questions for AI Engineering: Simon Willison

Open Questions for AI Engineering: Simon Willison

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

Using AI to Build an Infinite Game: Jeff Schomay

Using AI to Build an Infinite Game: Jeff Schomay

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

The Code AI Maturity Model and What It Means For You: Ado Kukic

The Code AI Maturity Model and What It Means For You: Ado Kukic

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

The Making of Devin by Cognition AI: Scott Wu

The Making of Devin by Cognition AI: Scott Wu

The Future of Knowledge Assistants: Jerry Liu

The Future of Knowledge Assistants: Jerry Liu

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Open Challenges for AI Engineering: Simon Willison

Open Challenges for AI Engineering: Simon Willison

Lessons From A Year Building With LLMs

Lessons From A Year Building With LLMs

From Software Developer to AI Engineer: Antje Barth

From Software Developer to AI Engineer: Antje Barth

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

What's new from Anthropic and what's next: Alex Albert

What's new from Anthropic and what's next: Alex Albert

Using agents to build an agent company: Joao Moura

Using agents to build an agent company: Joao Moura

Decoding the Decoder LLM without de code: Ishan Anand

Decoding the Decoder LLM without de code: Ishan Anand

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building Reliable Agentic Systems: Eno Reyes

Building Reliable Agentic Systems: Eno Reyes

10x Development: LLMs For the working Programmer - Manuel Odendahl

10x Development: LLMs For the working Programmer - Manuel Odendahl

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Hypermode Launch: Kevin Van Gundy

Hypermode Launch: Kevin Van Gundy

Git push get an AI API: Ryan Fox-Tyler

Git push get an AI API: Ryan Fox-Tyler

This video covers advanced AI engineering concepts, including language user interfaces, retrieval augmented generation, and vector databases, with a focus on agent foundations and tool use. It provides practical steps for building and evaluating AI systems, including the use of foundation models, fine-tuning techniques, and vector storage.

Key Takeaways

Build Language User Interfaces
Implement Retrieval Augmented Generation
Design Vector Databases
Use Foundation Models
Apply Fine-tuning Techniques
Integrate Vector Storage
Evaluate AI Systems
Use Reference Matching

💡 The video highlights the importance of observability and evaluation in AI systems, particularly when debugging is hard or impossible, and provides practical steps for building and evaluating AI systems.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Agent Foundations

View skill →

Build and Deploy an Agent with Reasoning Engine in Vertex AI

Adding a Phone Gateway to a Virtual Agent

From Zero to Working AI Agent in 60 Seconds

From Zero to Working AI Agent in 60 Seconds

Create An AI Agent With Replit That Automates Your Sales

Create An AI Agent With Replit That Automates Your Sales

Capstone: Autonomous Runway Detection for IoT

Capstone: Autonomous Runway Detection for IoT

AI Agents with Model Context Protocol & Typescript

AI Agents with Model Context Protocol & Typescript

Related AI Lessons

5 prompt engineering techniques to get the best out of a legacy project

Learn 5 prompt engineering techniques to improve legacy project performance and why they matter for maintaining outdated codebases

Dev.to · Marco Coelho

The Real Reason Prompt Engineering Isn't Going Away

Learn why prompt engineering remains a crucial skill in AI development and how to apply it effectively

Common Prompt Engineering Mistakes and How to Avoid Them

Learn to avoid common prompt engineering mistakes to get better results from AI tools

Medium · ChatGPT

Day 5: Prompt Engineering Basics (For DevOps & Cloud Engineers)

Learn prompt engineering basics for DevOps and cloud engineers to improve AI model interactions

Chapters (9)

Intro

1:09 Patterns for Language User Interfaces

6:19 RAG: Information Retrieval for Generation

21:52 Function Calling: Structured Outputs and Tool Use

35:03 Agents and Cognitive Architectures

40:52 Shipping to Learn in ML + AI

45:42 LLM Monitoring and Observability Tools

49:42 Evaluating LLMs

55:48 Inspirational Outro

I Built an AI Agent in 6 Minutes (No Code, No Developer)

HubSpot Marketing