Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Latent Space · Beginner ·🚀 Entrepreneurship & Startups ·2y ago

Skills: Agent Foundations90%Tool Use & Function Calling80%Multi-Agent Systems70%

Key Takeaways

Kanjun Qiu discusses the limitations of AI agents and the need for debugging tools, with a focus on Imbue's approach to optimizing reasoning inside pre-training models and developing interfaces for human-agent collaboration.

Full Transcript

[Music] hey everyone welcome to the laden space podcast this is alesio partner and CTO and residents at deel partners and I'm joined by my co-host swix founder of small AI hey and today in the studio we have kin from mbw welcome thank you so uh we you and I have I guess cross paths a number of times uh and you're form you're formerly named General generally intellig and you've just announced your rename Rebrand in uh huge humongous race so congrats on all of that thank you and we're here to dive in into deeper detail on mbu uh we' like to introduce you um just U uh on a on a high level basis but then have you go into a little bit more of your personal side um so you graduated uh your BS and Ms and MIT at uh at MIT and you also spent some time at the MIT media lab one of the most most famous I guess computer hacking labs in the world true any fun stories from that time yeah I built it say uh electronic textiles so like boards that uh make it possible to make like soft clothing uh like you can sew circuit boards into clothing and then Mak clothing electronic it's not that useful you wrote a book about that I wrot a book about it yeah yeah teach basically the idea was to like teach young women computer science in this route because what we found was that uh young girls they would be like really excited about math until about sixth grade and then they're like oh um math is not not good anymore uh because I I don't feel like the type of person who does math or does programming but I do feel like the type of person who does crafting so it's like okay what if you combine the two yeah yeah awesome awesome um always more detail to dive into on that um but then you graduated MIT and you went to went straight into bizops at Dropbox where you're uh eventually chief for staff which is pretty interesting role we can dive into later and then it seems like the founder bug hit you you were basically a three times founder at Ember sorceress and now gener intelligence slm m u what should people know about you on the personal side that's not on your LinkedIn that um you're something you're very passionate about outside of work yeah I think um if you ask any of my friends they would tell you that I'm obsessed with agency like human agency and human potential that's work come on that's not work what are you talking about so like uh what's what's what's an example of human agency that you try to promote yeah like uh with all of my friends I have a lot of conversations with them that's like helping figure out what's blocking them um I guess I do this with a team kind of automatically too and think about it for myself often like Building Systems I have a lot of systems to like help myself be more effective at Dropbox I used to give this onboarding talk called how to be effective um which people liked I think like a thousand people heard this onboarding talk and I think maybe drawbox was more effective um and I think I just really believe that like as humans we can be a lot more than we are um and it's what drives everything I guess completely outside of fork I do dance I do partner dance nice yeah yeah lots of interest in uh that stuff especially in like the sort of group living um houses in in San Francisco which I've been a little bit part of and you've also run one of those that's right yeah I started the archive with two friends oh with Josh my co-founder and a couple other in 2015 that's right and gbd3 our housemates built so was that the I guess the precursor to generally intelligent that that you started um doing more things with Josh is that how that relationship started yeah so Josh and I yeah Josh and I are uh this is our third company together our first company Josh uh poached me from Dropbox for Ember and uh there we built a really interesting technology uh laser raster projector VR headset and then we were like VR is not the thing we're most passionate about and actually it was you know kind of early days when we were both realized like we really do believe that in our lifetimes like computers that are intelligent are going to be able to allow us to do much more than we can do today as people and be much more as people than we can be today and um at that time we actually after Ember we were like should we like work on AI research or start an AI lab um a bunch of our housemates were joining open Ai and we actually decided to do something more pragmatic to apply AI to recruiting and to try to understand like okay if we actually trying to deploy these systems in the real world what's required and that was sorceress that taught us so much about what uh that was maybe an AI agent in a lot of ways um like what does it actually take to make a product that people can trust and rely on um I think we never really fully got there um and it's taught me a lot about what's required um and it's kind of like I think informed some of our approach and some of the way that we think about how uh these systems will actually get used by people in the real world just to go one step deeper on that so you Bu you're building AI agents in 2016 before it was cool um what so you got some El you raised $30 million um something was working so what what do you think like you succeeded in doing and then what did you uh try to do that did not pan out yeah so the product worked quite well uh so sorceress was an AI system uh that basically kind of looked for candidates that could be a good fit and then helped you reach out to them and this was you know a little bit early we didn't have language models to help you reach out so we actually had a team of writers that like you know customized emails um and we automated a lot of the customization uh but the product was pretty IAL like candidates would just be interested in land in your inbox and then you can talk to them as a hiring manager that's such a good experience um I think there were a lot of learnings both on the product and Market side on the market side recruiting is a market that is endogenously high churn which means because people start hiring and then we hire the role for them and they stop hiring so the more we succeed the more they it's like the whole dating business it's the dating business exactly exactly it's exactly the same problem as the dating business and I was really passionate about like can we help people you know find work that is more exciting for them a lot of people are not excited about their jobs and a lot of companies are doing exciting things and the matching could be a lot better um but the dating business kind of um phenomenon like put a damper on that so we we had a good like it's actually pretty good business um but as with any business with like relatively High turn the bigger it gets the more Revenue we have the slower growth becomes because like percent if 30% of that Revenue you lose year over- year then it becomes a worse business yeah so that was the dynamic we noticed quite early on um after our series a I think the other really interesting thing about it is we realized what was required for people to trust that these candidates were like well vetted and had been selected for a reason um and it's what actually led us you know a lot of what we do at MB is working on interfaces to figure out how do we get to a situation where when you're building and using agents these agents are trustworthy to the end user that's actually one of the biggest issues with agents that you know go off and do longer range goals is that I have to trust like did they actually think through this situation um and that really informed a lot of our work today yeah let's jump into GI now Inu um when did you decide recruiting was done for you and you were ready for the the next challenge and how did you pick the agent space I feel like in 2021 it wasn't as mainstream I said yeah so the LinkedIn says that it started in 2021 but actually we started thinking very seriously about it in early 2020 late 2019 early 20120 um not exactly this idea but uh in late 2019 so I mentioned our housemates Tom Brown and Ben man they're the first two authors on gbd3 so what we were seeing is that scale is scale is starting to work um and language models probably will actually get to a point where like with hacks they're actually going to be quite powerful and and it was hard to see that at the time actually because uh like gbd3 the early versions of it you know there are all sorts of issues we're like ah it's not that useful but we could kind of see like okay you keep improving it in all of these different ways um and it'll get better and so what Josh and I were really interested in is how can we get computers that help us do bigger things like you know there's this kind of future where I think a lot about uh you know if I were born in 1900 as a woman like my life would not be that fun uh I'd spend most of my time like carrying water and literally like getting wood to put in the stove to cook food and like cleaning and scrubbing the dishes and you know uh getting food every day because there's no refrigerator like all of these things very physical labor and what's happened over the last 150 years since the Industrial Revolution is we've kind of gotten free energy like energy is way more free than it it was 150 years ago and so as as a result we've built all these Technologies like the stove and the dishwasher and the refrigerator and we have electricity and we have uh infrastructure running water all these things that have totally freed me up to do what I can do now and I think the same thing is true for intellectual energy we don't really see it today but like because we're so in it but our computers have to be micromanaged you know part of why people are like oh you're stuck to your screen all day well we're stuck to our screen all day because literally nothing happens unless I'm doing something in front of my screen I don't you know can't send my computer off to do a bunch of stuff for me there is a future where that's not the case where you know I can actually go off and do stuff and trust that my computer will pay my bills and figure out my travel plans and do the detailed work that I am not that excited to do so that I can like be much more creative and able to do things that I as a human am very excited about and collaborate with other people and there are things that people are uniquely suited for so that's kind of always been the thing that is really exciting uh has been really exciting to me like Josh and I have known for a long time I think that AI uh you know whatever AI is it would happen in our lifetimes and um and the personal computer kind of started giving us a bit of free intellectual energy and this is like really the explosion of free intellectual energy so in early 2020 we were thinking about this and uh what happened was self-supervised learning basically started working across everything so it worked in language uh Sim clear came out I think Moco had come out momentum contrast had come out earlier in 2019 Sim clear came out in earlier 2020 and we're like okay for the first time selfed learning is working really well across images and text and suspect that like okay actually it's the case that machines can learn things the way that humans do um and if that's true if they can learn things in a fully self-supervised way because like as people we are not supervised we like go Google things and try to figure things out so if that's true then like what the computer could could be is much different you know is much bigger than what it is today and so we started exploring ideas around like how do we actually go we didn't think about the the fact that we could actually just build a research lab so we're like okay what kind of startup could we build to like Leverage self- supervised learning so that it eventually becomes something that allows computers to become much more uh kind of able to do bigger things for us but that became generally intelligent which started as a research lab yeah and so your mission is uh you aim to brindle the dream of the personal computer so when did it go wrong and what are like your first um products and kind of like a user facing things that you're building to rekindle it yeah so what we do at MB is we uh train large Foundation models optimized for reasoning and the reason for that is because reasoning is actually we believe the biggest blocker to agents or syst sys that can do these larger goals um if we think about you know something that writes an essay like when we write an essay we like write it we don't just output it and then we're done we like write it and then we look at it and we're like oh I need to do more research on that area I'm going to go do some research and figure out and come back and oh actually it's not quite right the structure of the outline so I'm going to re rearrange the outline rewrite it it's this very iterative process and it requires thinking through like okay uh what am I trying to do is the goal correct also like has the goal change as I've learned more also you know as a tool like when should I ask the user questions I shouldn't ask them questions all the time but I should ask them questions in higher risk situations um how certain am I about the like flight I I'm about to book um there are all of these Notions of like Risk certainty playing out scenarios figuring out how to make a plan that makes sense how to change the plan what the goal should be that are uh things you know that we lump under the bucket of reasoning and models today they're not optimized for reasoning it turns out that there's not actually that much explicit reasoning data on the internet um as you would expect and so we get a lot of mileage out of optimizing our models for reasoning in pre-training and then on top of that we build agents ourselves we I can get into we really believe in serious use like really seriously using the systems and trying to get to an agent that we can use every single day tons of agents that we can use every single day and we experiment with interfaces uh that help us better interact with the agents so those are some set of things that we do on the kind of model training and agent side and then uh the initial agents that we build a lot of them are trying to help us write code better because code is most of what we do every day and then on the infrastructure and theory side we actually do a fair amount of theory work to understand like how do these systems learn and then also like what are the right abstractions for us to build good agents with um which we can get more into and uh if you look at our website we have a lot of tools um we build a lot of tools internally we have a like really nice automated hyper paramet Optimizer um we have a lot of really nice infrastructure and it's all part of the belief of like okay let's try to make it so that the humans are doing the things humans are good at as much as possible so out of our very small team we get a lot of Leverage and so would you still categorize yourself as a research lab now or are you now in startup mode is that a transition that is conscious at all that's really interesting question I think we've always intended to build you know to try to build the next version of the computer enable the next version of the computer um the way I think about it is there is a right time to bring a technology to Market so Apple does this really well actually iPhone was underdevelopment for 10 years airpods for 5 years um and apple has a story where you know iPhone uh the first multi-touch screen was created they actually were like oh wow this is cool uh let's like productionize iPhone they actually brought uh they like did some work trying to productionize it and realized this is not good enough and they put it back into research to try to figure out like how do we make it better what are the interface pieces that are needed and then they brought it back into production so I I think of production and and research as kind of like these two separate phases and internally we have that concept as well um where like things need to be done in order to get to something that's usable and then when it's usable like eventually we figure out how to productize it what's the culture like to make that happen to have both like kind of like product oriented research oriented and as you think about building the team I mean you just raised 200 million I'm sure you want to hire more people uh what what are like the the right archetypes of people that work at Inu H yeah I would say we have a very unique culture in a lot of ways um I think a lot about social process design so how do you design social processes that enable people to be you know effective um I like to think about team members as creative agents so because most companies they think of their people as assets and they're very proud of this and I think about like okay what is an asset it's something you own uh that provides you value that you can discard at any time this is a very low bar for people this is not what people are um and so we try to enable everyone to be a creative agent and to really unlock their superpowers so a lot of the work I do you know I was mentioning ear earlier I'm like obsessed with agency a lot of the work I do with with team members is try to figure out like you know what are you really good at what really gives you energy and where can we put you such that um and how can I help you unlock that and grow that um so much of our work you know in terms of Team structure like much of our work actually comes from people carbs our hyper parameter Optimizer came from ABE trying to automate him his own research process uh doing hyper paramet optimization and he actually pulled some ideas from plasma physics he's a plasma physicist to make the local search work a lot of our work on evaluations comes from a couple members of our team who are like obsessed with evaluations we do a lot of for trying to figure out like how do you actually evaluate if the model is getting better is the model making better agents is the agent actually reliable um and so a lot of things kind of like I think of people as making the like them shaped blob inside mbw and I think you know yeah that's the kind of person uh kind of person that we're we're hiring for we're hiring product engineers and data engineers and uh research engineers and all these roles um you know we have a project we have projects not teams um we have a project around data data collection and data engineering that's actually one of the key things that improve the model performance we have a pre-training kind of project uh some and with some fine-tuning as part of that and then we have an agents project that's like trying to build on top of our models as well as use other models um in the outside world to try to make agents that then we actually use as programmers every day so all sorts of different different projects as a Founder you you're now s of a capital allocator among all these different Investments effectively in different projects um and I was interested in how you mentioned that you're you're optimizing for uh improving reasoning specifically inside of your pre-training which I I assume is just a lot of data collection we are optimizing reasoning inside of our uh pre-train models and a lot of that is about data and I can talk more about like what you know what exactly does it involve um but uh actually big maybe 50% plus of the work is figuring out even if you do have models that reason well like the models are still stochastic the way you prompt them still makes is kind of random like makes them do random things and so how do we get to something that is actually robust and reliable as a user how can I as a user trust it you know I was mentioning earlier um when I talked to other people building agents they have to do so much work like to try to get to something that they can actually productize and um it takes a long time and agents haven't been producted yet for partly for this reason is that like the abstractions are very leaky um you know we can get like 80% of the way there but like self-driving cars like the remaining 20% is actually really difficult we believe that and we have internally I think um some things that like an interface for example that um lets me really easily like see what the agent execution is forkit try out different things modify the prompt um modify like the plan that it it is making uh this type of interface it makes it so that I feel more like I'm collaborating with the agent as it's executing as opposed to it's just like doing something as a blackbox um that's an example of a of a type of thing that's like Beyond just the model pre-training but on the reasoning yeah on the model pre-training side like reasoning is a thing that we optimize for and a lot of that is about yeah what data do we put in yeah it's interesting just because I I always think like you know out of the levers that you have the resources that you have I think a lot of people think that running a foundation model company or a research lab is going to be primarily compute and I think the share of compute has gone down a lot o over the past three years yeah uh it used to be the main story like the the main way you scale is you just throw more compute at it uh and now it's like flops is not all you need you need better data you need better algorithms and uh I wonder where that shift has gone I don't this is a very vague question but is it like 30 3030 now is it like maybe even higher so uh one way I'll put this is um people estimate that llama 2 maybe took about3 $4 million a compute but probably 20 $25 million wor labeling data um and I'm like okay well that that's a very different story than all these other Foundation model Labs raising hundreds of millions of dollars and spending it on gpus yeah uh data is really expensive um we generate a lot of data and so that does help um the generated data is close to actually good as good as human labeled data um so generated data from other models from our own models or other models yeah do you feel like and this is there's there's there's certain variations of this uh there's the sort of the Constitutional um AI approach from anthropic and U basically models sampling training on data from other models I feel like there's a little bit of like contamination in there or uh to put it in a statistical form you're resampling a distribution that you already have that you already know doesn't match human distributions yeah yeah how do you feel about that basically just philosophically so when we're optimizing models for reasoning we are actually like trying to like uh make a part of the distribution really spiky yes so in a sense like that's actually what we want we we want to because the internet is a sample of the human distribution that's also skewed in all sorts of ways yes um you know that is not the data that we necessarily want these models to be trained on and so I don't worry about it that much like what we've seen so far is that it seems to help when we're generating data we're not we're not really randomly generating data we generate very specific things uh that are like reasoning traces and that help optimize reasoning code also is a big piece of improving reasoning so yeah uh generated code is not that much worse than like regular human written code you might even say it can be better in a lot of ways so yeah so we are trying to already do that what are some of the tools that you saw that you thought we're not a good fit so you built Avalon which is um your own simulat world and when you first started the kind of like metag game was like using games to simulate things uh using you know uh Minecraft and then open eyes like the gym thing and all these things and you're think I think in one of your other podcasts you mentioned like Minecraft is like way too slow to actually do any serious work uh what L you to like yeah I didn't I didn't say it I don't know that's about my figur it uh but Avalon is like a 100 times faster than Minecraft for for simulation um when did you figure that out that you needed to just like build your own thing was it um kind of like your engineering team was like hey this is too slow was it more a long-term investment at that time we built Avalon as a research environment to help us learn particular things and one thing we were trying to learn is like how do you get an agent that is able to do many different tasks uh like RL agents at that time and environments at that time what we heard from other RL researchers was the like biggest thing keeping holding the field back is lack of benchmarks that let us uh kind of explore things like planning and curiosity and things like that and have the agent actually perform better if the agent has curiosity and so we were trying to figure out out like okay how can we have agents that are uh like able to handle lots of different types of tasks in a without the the reward being pretty handcrafted um that's a lot of what we had seen is that like these very handcrafted rewards and so Avalon has like a single reward um it's you know across all tasks and what it taught us and it also allowed us to kind of create a curriculum so we could make the level more or less difficult and it taught us a lot uh maybe two primary things one is with no curriculum RL algorithms don't work at all so that's actually really interesting um for for the non- oral specialist what is a curriculum in your terminology uh so a curriculum in this particular case is uh basically the environment Avalon lets us generate simpler environments and harder environments for a given tasks what's interesting is that the simpler environments you know what you expect is the agent succeeds more often so it gets more reward uh and so so you know kind of my intuitive way of thinking about it is okay the reason why it learns much faster with a curriculum is it's just getting a lot more signal and uh that's actually an interesting kind of like General intuition to have about training these things uh is like what kind of signal are they getting and like in what like how how can you help it get a lot more signal um the second thing we learned is that uh reinforcement learning is not a good vehicle like pure reinforcement learning is not a good vehicle for planning and reasoning so these agents were not able to they were able to learn all sorts of crazy things they could learn to climb like hand overhand in VR climbing they could learn to open doors like very complicated like multiple switches and a lever uh open open the door but uh they couldn't do any higher level things and they couldn't do those lower level things consistently necessarily um and as a user we were like okay as a user I do not want to interact with a pure reinforcement learning endtoend RL agent as a user like I need much more control over what that agent is doing and so that actually started to get us on the track of thinking about okay how do we do the reasoning part in language and we were pretty inspired by our friend Chelsea Finn at Stanford was I think working on Sean at the time um where it's basically a um uh an experiment where they have robots kind of you know trying to do different tasks and they actually do the reasoning for the robot in natural language and it worked quite well um and that led us to start experimenting very seriously with reasoning how important is the language part for the agent versus for you to inspect the agent you know like is it the the interface to kind of the human on the loop really important or yeah I personally think of it as it's much more important for us the human user so I think you probably could get endtoend agents that work and are fairly General um at some point in the future uh but I think you don't want that like we actually want agents that we can like perturb while they're trying to figure out what to do because you know even a very simple example um internally we have like a type error fixing agent and we have like a test generation agent test generation agent goes off the rails all the time and I want to know like why did it generate this particular test what was it thinking did it consider you know the fact that this is calling out to this other function like formatter agent if it ever comes up with anything weird I want to be able to debug like what happened with RL end to end stuff like we couldn't do that uh so it sounds like you have a bunch of Agents operating internally within uh the company um what's your most I guess successful agent and what's your least successful one yeah a type of agent that works moderately well is like fix this uh the color of this button on the website or like like change color now sweep. Dev is doing that exactly perfect okay well we should just use sweep. Dev well I mean okay I don't know how how often do you have to fix color of the button right because all them raise money on the idea that they can go further yeah and my fear when encountering something like that is that there's some kind of unknown asot ceiling that's going to prevent them that they're going to run head on into that you've already run into uh We've definitely run into such a ceiling um what is the CE is there a name for it like what I mean for us we think of it as reasoning plus these tools so um reasoning Plus ractions basically I think actually you can get really far with current models um and that's why it's so compelling like we can pile debugging tools on top of these current models have them critique each other and critique themselves and do all of these like uh you know spend more computer inference time context hack um you know retrieval augmented generation uh etc etc etc like the pile of hacks actually does get us really far and it kind of like trying to get more signal out of the channel um we don't like to think about it that way it's what it's what the default approach is is like trying to get more signal out of this noising CH noisy Channel but the issue with agents is as a user I want it to be mostly reliable um it's kind of like self-driving in that way like it's not as bad as self-driving like in self-driving you know you're like hurling at 70 miles an hour is like the hardest agent problem but I think one thing we learned from sorceress and one thing we're learn we've learn inter like by using these things internally is we actually have a pretty high bar for these agents to work um you know it is actually really annoying if they only work 50% of the time and we can make interfaces to make it slightly less annoying but yeah there there's a ceiling that we we can encountered so far and we need to make the models better and we also need to make the kind of like interface to the user better and also a lot of the like you know critiquing uh we have a lot of like generation methods um kind of like spending computer inference time generation methods that help uh things be more robust and reliable but it's still not 100% of the way there so to your question of like what agents work well and what doesn't work well like most of the agents don't work well and we're slowly making them work better by improving the underlying model and improving these I think that that's comforting for a lot of people who are feeling a lot of impost syndrome not being able to make it work and I think uh the fact that you share their struggles I think also um helps people understand how early this is yeah definitely it's very early and I hope what we can do is help people who are building agents actually like be able to deploy them um I think you know that's the Gap that we see a lot of today is everyone who's trying to build agents to get to the point where it's robust enough to be Deployable it just it's like an unknown amount of time okay yeah well so this goes back into what m is going to offer as a product or a platform how are you going to actually help people deploy those agents yeah so our current hypothesis I don't know if this is actually going to end up being the case um we built a lot of tools for ourselves internally around like debugging around like abstractions or techniques after the model generation happens like after the language model generates uh the text uh like interfaces for the user uh and the underlying model itself uh like models talking to each other maybe some set of those things kind of like an operating system some set of those things will be helpful for other people um and we'll figure out what set of those things is helpful for us to make our agents like what we want to do is get to a point where we can like start making an agent deploy it it's reliable like very quickly and there's a similar analog to software engineering like in the early days in the 70s and the 60s like to program a computer like you have to go all the way down to the registers but um and write things in assem eventually we had assembly that was like an improvement then we wrote programming languages with these higher levels of abstraction and that allowed a lot more people to do this and much faster and the software created is much less expensive and I think it's basically a similar route here where we're like in the like bare metal phase of agent building and we'll eventually get to something with much nicer abstractions so you touched a little bit on the data before we had this conversation with George Hots and we were like there's not a lot of reasoning data out there and can the models really understand and his take was like look with enough compute you're not that complicated as a human like the model can figure out eventually why certain decisions are made what's been your experience like as you think about reasoning data like do you have to do a lot of like manual work or like is there a way to prompt models to kind of like extract the reasoning from actions that they see we don't think of it as oh throw enough data at it and then it will figure out what like what the plan should be uh I think we're much more explicit so we have a lot of thoughts internally like many documents about what reasoning is you know a way to think about it is as humans we've learned a lot of reasoning strategies over time we are better at reasoning now than we were 3,000 years ago um an example of a reasoning strategy is noticing you're confused uh and like then when I notice I'm confused I should ask like huh what was the original claim that was made what evidence is there for this claim uh etc etc does the evidence support the claim is the claim correct this is like a reasoning strategy that was developed in like the 1600s you know with like the aent of Science of science that's an example of a reasoning strategy there are tons of them we employ all the time lots of heris stics that help us be better at reasoning and um we didn't always have them and because they're invented like we can generate data that's much more specific to them so I think internally yeah we have a lot of thoughts on what reasoning is and we generate a lot more specific data we're not just like oh it'll figure out reasoning from this black box um or like it'll figure out reasoning from the data that that exists yeah I mean the scientific meod is like a good example and if you think about hallucination right and people are thinking how do we use these models to do net new like scientific research and you know if you go back in time and the model is like well the Earth revolves around the Sun and people are like man this model is crap it's like what are you talking about like the sun revolves around the earth it's like how how do you see that future where like do you think we can actually like if the models are actually good enough but we don't believe them it's like how do we how do we make the two live together say you're like use Inu as a scientist to do a lot of your research and inbu tells you hey I think this is like a serious pet you should go down and you're like no this sounds impossible like how is that trust going to be built and like what are some of the tools that maybe are going to be there to to inspect it yeah so like one one element of it is uh like as a person like I need to basically get information out of the model that can what's with the model so then the second question is like okay how do you do that uh and that's kind of some of our debugging tools they're not necessarily just for debugging they're also for like interfacing with and interacting with the model so like if I go back in this reasoning trace and like change a bunch of things what's going to happen like what does it conclude instead um so that kind of helps me understand like what are its assumptions um and it you know we think of these things as tools um and and so it's really about like as a user how do I use this tool effectively like I need to be willing to be convinced as well um it's like how do I use this tool effectively and what can it help me with and what can it tell me so there's a lot of mention of code in your in your process um and I was hoping to dive in even deeper I think we might run the risk of giving people the impression that you you view code or you use code um just as like a a a tool within within yourself within mbu just just for coding assistance and I think there's a lot of informal understanding about how adding codes to language models improves their reasoning capabilities I wonder if there's any research or findings that you have to share that um uh talks about the intersection of code and reasoning yeah so the way I think about it intuitively is like code is the most explicit example of reasoning data on the internet yeah and it's not only structured it's actually very explicit which is nice you know it says this variable means this and then it uses this variable and then the function does this like as people when we talk in language it takes a lot more to kind of like extract that like explicit structure out of like our our language and so that's one thing that's really nice about code is I see it as almost like a curriculum for reasoning I think we use code in all sorts of ways like uh the code the coding agents are really helpful for us to understand like what are the limitations of the agents uh the code is really helpful for the reasoning itself but also code is a way for models to act so by generating code it can act on my computer and you know when we talk about rekindling the dream of the personal computer kind of where I see computers going is computers will eventually become these much more malleable things where I as a user today I have to know how to write software code like in order to make my computer do exactly what I want it to do but in the future if the computer is able to generate its own code then I can actually interface with it in natural language um and so we you know one way we think about agents is is kind of like a natural language programming language uh it's a way to program my computer in natural language that's much more intuitive to me as a user and these interfaces that we're building are essentially Ides for users to program our computers in natural language what do you think about the other the different approaches people have kind of like text first browser first like mulon um what do you think the in the best interface will be or like what is your you know thinking today uh I think chat is very limited as an interface it is sequential um where these agents don't have to be sequential so with a chat interface if the agent does something wrong I have to like figure out how to like how do I get it to go back and start from the place I wanted it to start from so in a lot of ways like chat as an interface I think Linus lonus Lee you had on on this I really like how he put it chat as an interface is skoric so in the early days when we made word processors on our computers they had notepad lines because that's what we understood uh you know these like objects to be chat like texting someone is something we understand so texting our AI is something that we understand but today's Word documents don't have no pad lines um and similarly the way we want to interact with agents like chat is a very primitive way of interacting with agents uh what we want is to be able to inspect their state and to be able to modify them and Fork them and all of these other things and we internally have kind of like think about what are the right representations for that like architecturally uh like what are the right representations what kind of abstractions do we need to build and how do we build abstractions that are not leaky because if the abstractions are leaky which they are today like you know this stochastic generation of text is like a leaky abstraction I cannot depend on it and that means it's actually really hard to build on top of but our experience and belief is actually by building better abstractions and better tooling we can actually make these make these things non- leaky and now you can build like whole things on top of them so these other other interfaces because of where we are we don't think that much about them cool yeah I mean you mentioned this is kind of like the cro spark moment for AI um and we had a lot of stuff come out of park like the yeah the what you see is what you get ERS and like MVC and all this stuff but yeah but then we didn't have the iPhone at Park we didn't have all these like higher things what do you think it's reasonable to expect in like this era of AI you know kind of like five years or so like what are like the things we'll build today and what are things that maybe we'll see kind of like the second wave of of products I think the waves will be much faster than before like what we're seeing right now is basically like a continuous wave let me Zoom a little bit earlier so people like the Xerox Park analogy I give but I I think there are many different analogies like one is uh the like analog to digital computer is another analogy to where we are today the analog computer vanar Bush bu built in the 1930s I think and it's like a system of pulley and it can only calculate one function like it can calculate like an integral and that was so IAL at the time because you actually did need to calculate this integral Bunch but it had a bunch of issues like in analogs compound and so there was actually a set of breakthroughs necessary uh in order to get to the digital computer like uh turing's decidability Shannon I think the like whole like relay circuits are are um can be thought of as can be mapped to Boolean operators and a set of other like theoretical breakthroughs which essentially they were creating abstractions for these like very analog circuits uh and digital had this nice property of like being error correcting and and so when I talk about like less leaky abstractions that's what I mean that's what I'm kind of pointing a little bit to it's not going to look exactly the same way and then the Xerox Park piece a lot of that is about like how do we get to computers that as a person I can actually use well and the interface actually helps it unlock so much more power so the sets of things we're working on like the sets of abstractions and the the interfaces like hopefully that like help us unlock a lot more power in these systems like hopefully that'll come not too far in the future um I could see a next version uh like maybe a little bit farther out it's like an agent protocol so a way for different agents to talk to each other and call each other um kind of like HTP um o do you know it exists already yeah there is a nonprofit that's working on one I think it's a bit early but it's interesting to think about right now uh part of why I think it's early is because the issue with agents is it's not quite like the internet where you could like make a website and the website would appear the issue with agents is that they don't work um and so it may be a bit early to figure out what the protocol is before we really understand how these agents get constructed um but you know I think that's I think it's a really interesting question while we're talking on this agent to agent thing there's been a bit of research recently on some of these approaches um I tend to just call them extremely complicated chain of thating but um any any perspectives on kind of meta GPT I think it's the name of the paper I don't know if you care about uh indiv at the level of individual papers coming out um but I I did read that recently and it it tldr it beat GPT 4 and human eval by role playing uh software agent development agency instead of having a sort of single shot a single role you have multiple roles and how having all of them criticize each other as agents communicating with other agents yeah I think this is an example of an interesting abstraction of like okay can I just plop in this like multi-roll critiquing and see how it improves my agent um can I just plop in Chain of Thought tree of thought plop in these other things and see how they improve my agent um one issue with this kind of prompting is that it's still not very reliable like there's one lens which is like okay if you do enough of these techniques you'll get to high reliability and I think actually that's not that's a pretty reasonable lens we take that lens often um and then there's another lens that's like okay but it's starting to get really messy what's in the prompt and like how do we deal with that messiness um and so maybe you need like cleaner ways of thinking about and constructing these systems and we also take that we also take that lens so yeah I think both are necessary side question because uh I I feel like this also brought up another question I had for you like uh I I I noticed that you work a lot with your own benchmarks your own evaluations of what is valuable and uh I would say I would contrast your approach with open ey as open ey tends to just lean on hey we played Starcraft um or um hey we ran it on the SAT or the uh you know the AP bio test and and that did a results um basically um is Benchmark culture ruining AI um or is is that actually a good thing because everyone knows what an SAT is and that's fine I think it's important to use both public and internal benchmarks part of why we build our own benchmarks is that there are not very many good benchmarks for agents actually and to evaluate these things we actually need to think about it in a slightly different way um but we also do use a lot of public benchmarks okay for like is the reasoning capability in this particular way improving um so yeah it's good to use both like like so for example uh like the Voyager paper coming out of uh Nvidia um played Minecraft and set set their own benchmarks on uh getting the diamond X or whatever and uh and exploring as much of the territory as possible and I don't know how that's received among like that's that's obviously fun and novel for the rest of the engineer like en like people who are new to the scene but for for people like yourself who who you built your own a uh you built Avalon just because you already found defici IES with using Minecraft like is that valuable as an approach oh yeah I love voy I mean JY I think is awesome and I really like the Voyager paper and I think it has a lot of really interesting ideas which is like the agent can create tools for itself yeah and then use those tools and and he had the idea of the curriculum as well which which is something that we earlier ex exactly and and that's like a lot of what we do we built Avalon mostly because we couldn't use Minecraft for well to like learn the things we wanted and so it's like not that much work to build our own uh it took us I don't know uh we had like eight engineers at the time took about eight weeks so six weeks nice yeah yeah and opening eye built their own as well right yeah exactly it's just nice to have control over our environment but if you're doing yeah our own sandbox to really trying to inspect our research our own research questions but if you're doing something like experimenting with agents and trying to get them to do things like Minecraft is a really interesting environment um and so Voyer has a lot of really interesting ideas in it yeah cool uh one more element that we had on this list which which was context and memory um I think that's that's kind of like the the foundational quote unquote Ram of of our era um I think I think Andre kapati has has already made this comparison so there's nothing new here um but that's just the amount of working knowledge that we could fit into one of these agents and it's not a lot right especially if you need to get them to do long running tasks if they they need to self-correct from errors that they observe while while operating in their environment um do you see this as a problem do you think we're going to just Trend to infinite context and that'll go away or how do you think we're going to deal with it when you talked about what are what's going to happen in the first wave and then in the second wave I think what we'll see is we'll get like relatively simplistic agents pretty soon and they will get more and more complex um and there's like a future wave in which they are able to do these like really difficult really long running tasks and uh the blocker to that future one of the blockers is memory and and so and that was true of computers too you know uh I think when Von noyman made the Von noyman architecture he was like the biggest blocker will be me like we need this amount of memory which is like I I don't remember exactly like 32 kilobytes or something to store programs and that will allow us to write software um he didn't say it this way because he didn't have these terms but and then that only really was like happened in the

Original Description

Last month, Imbue was crowned as AI’s newest unicorn foundation model lab, raising a $200m Series B at a $1+ billion valuation. As “stealth” foundation model companies go, Imbue (f.k.a. Generally Intelligent) has stood as an enigmatic group given they have no publicly released models to try out1. However, ever since their $20m Series A last year their goal has been to “develop generally capable AI agents with human-like intelligence in order to solve problems in the real world”. Kanjun Qiu joined us to share more of their story. Full show notes: https://www.latent.space/p/imbue 00:00 - Introductions 07:13 - The origin story of Imbue 11:26 - Imbue's approach to training large foundation models optimized for reasoning 14:20 - Imbue's goals to build an "operating system" for reliable, inspectable AI agents 17:51 - Imbue's process of developing internal tools and interfaces to collaborate with AI agents 19:47 - Imbue's focus on improving reasoning capabilities in models, using code and other data 21:33 - The value of using both public benchmarks and internal metrics to evaluate progress 21:43 - Lessons learned from developing the Avalon research environment 23:31 - The limitations of pure reinforcement learning for general intelligence 32:12 - Imbue's vision for building better abstractions and interfaces for reliable agents 33:49 - Interface design for collaborating with, rather than just communicating with, AI agents 39:51 - The future potential of an agent-to-agent protocol 42:53 - Leveraging approaches like critiquing between models and chain of thought 47:30 - Kanjun's philosophy on enabling team members as creative agents at Imbue 59:54 - Kanjun's experience co-founding the communal co-living space The Archive 01:00:22 - Lightning Round

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Latent Space · Latent Space · 7 of 60

← Previous Next →

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Ep 18: Petaflops to the People — with George Hotz of tinycorp

FlashAttention-2: Making Transformers 800% faster AND exact

FlashAttention-2: Making Transformers 800% faster AND exact

RWKV: Reinventing RNNs for the Transformer Era

RWKV: Reinventing RNNs for the Transformer Era

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

RAG is a hack - with Jerry Liu of LlamaIndex

RAG is a hack - with Jerry Liu of LlamaIndex

The End of Finetuning — with Jeremy Howard of Fast.ai

The End of Finetuning — with Jeremy Howard of Fast.ai

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Four Wars of the AI Stack - Dec 2023 Recap

The Four Wars of the AI Stack - Dec 2023 Recap

The State of AI in production — with David Hsu of Retool

The State of AI in production — with David Hsu of Retool

Building an open AI company - with Ce and Vipul of Together AI

Building an open AI company - with Ce and Vipul of Together AI

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Making Transformers Sing - with Mikey Shulman of Suno

Making Transformers Sing - with Mikey Shulman of Suno

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Why Google failed to make GPT-3 -- with David Luan of Adept

Why Google failed to make GPT-3 -- with David Luan of Adept

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Breaking down the OG GPT Paper by Alec Radford

Breaking down the OG GPT Paper by Alec Radford

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

LLM Asia Paper Club Survey Round

LLM Asia Paper Club Survey Round

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How AI is Eating Finance - with Mike Conover of Brightwave

How AI is Eating Finance - with Mike Conover of Brightwave

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

State of the Art: Training 70B LLMs on 10,000 H100 clusters

State of the Art: Training 70B LLMs on 10,000 H100 clusters

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Synthetic data + tool use for LLM improvements 🦙

Synthetic data + tool use for LLM improvements 🦙

RLHF vs SFT to break out of local maxima 📈

RLHF vs SFT to break out of local maxima 📈

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Answer.ai & AI Magic with Jeremy Howard

Answer.ai & AI Magic with Jeremy Howard

Is finetuning GPT4o worth it?

Is finetuning GPT4o worth it?

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Building AGI with OpenAI's Structured Outputs API

Building AGI with OpenAI's Structured Outputs API

Q* for model distillation 🍓

Q* for model distillation 🍓

Finetuning LoRAs on BILLIONS of tokens 🤖

Finetuning LoRAs on BILLIONS of tokens 🤖

Cursor UX team is CRACKED 💻

Cursor UX team is CRACKED 💻

Choosing the BEST OpenAI model 🏆

Choosing the BEST OpenAI model 🏆

How will OpenAI voice mode change API design?

How will OpenAI voice mode change API design?

STEALING OpenAI models data 🥷

STEALING OpenAI models data 🥷

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

Prompt Engineer is NOT a job 📝

Prompt Engineer is NOT a job 📝

Prompt Mining LLMs for better prompts ⛏️

Prompt Mining LLMs for better prompts ⛏️

The six pillars of few-shot prompting 🔧

The six pillars of few-shot prompting 🔧

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Can you separate intelligence and knowledge?

Can you separate intelligence and knowledge?

This video discusses the current state of AI agents, their limitations, and the need for debugging tools and improved interfaces for human-agent collaboration. Kanjun Qiu shares Imbue's approach to optimizing reasoning in pre-training models and developing interfaces for human-agent collaboration.

Key Takeaways

Identify the limitations of current AI agents
Develop debugging tools for inspecting and interacting with models
Design interfaces for human-agent collaboration
Apply reinforcement learning and curriculum learning to improve agent performance
Develop multi-agent systems and agent communication protocols

💡 The development of AI agents requires a focus on optimizing reasoning, improving interfaces for human-agent collaboration, and developing debugging tools to inspect and interact with models.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Agent Foundations

View skill →

Build and Deploy an Agent with Reasoning Engine in Vertex AI

Adding a Phone Gateway to a Virtual Agent

From Zero to Working AI Agent in 60 Seconds

From Zero to Working AI Agent in 60 Seconds

Create An AI Agent With Replit That Automates Your Sales

Create An AI Agent With Replit That Automates Your Sales

Capstone: Autonomous Runway Detection for IoT

Capstone: Autonomous Runway Detection for IoT

AI Agents with Model Context Protocol & Typescript

AI Agents with Model Context Protocol & Typescript

Related AI Lessons

The New Geography Of Entrepreneurship—How Founders Are Rethinking Where To Build

Learn how founders are rethinking where to build their startups, and why geography matters in entrepreneurship, to inform your own business decisions

Forbes Innovation

Esports Company BLAST Reports Record Growth Following US Expansion

Esports company BLAST achieves record growth after US expansion, demonstrating the potential of strategic market expansion in the gaming industry

Forbes Innovation

Explorers Get Naming Rights. Infrastructure Builds The Future.

Building space infrastructure is key to winning the Second Space Race, driven by private innovation and smart policy

Forbes Innovation

Jerry Soko named Eswatini CEO as MTN doubles down on internal talent

MTN prioritizes internal talent by appointing Jerry Soko as Eswatini CEO, highlighting the importance of developing leaders within the organization

Chapters (16)

Introductions

7:13 The origin story of Imbue

11:26 Imbue's approach to training large foundation models optimized for reasoning

14:20 Imbue's goals to build an "operating system" for reliable, inspectable AI agen

17:51 Imbue's process of developing internal tools and interfaces to collaborate wit

19:47 Imbue's focus on improving reasoning capabilities in models, using code and ot

21:33 The value of using both public benchmarks and internal metrics to evaluate pro

21:43 Lessons learned from developing the Avalon research environment

23:31 The limitations of pure reinforcement learning for general intelligence

32:12 Imbue's vision for building better abstractions and interfaces for reliable ag

33:49 Interface design for collaborating with, rather than just communicating with,

39:51 The future potential of an agent-to-agent protocol

42:53 Leveraging approaches like critiquing between models and chain of thought

47:30 Kanjun's philosophy on enabling team members as creative agents at Imbue

59:54 Kanjun's experience co-founding the communal co-living space The Archive

1:00:22 Lightning Round

Watch this before applying for jobs as a developer.