Building AGI with OpenAI's Structured Outputs API
Key Takeaways
This video discusses building AGI with OpenAI's Structured Outputs API, covering its features, use cases, and applications, including function calling, structured outputs, and model refinement.
Full Transcript
yeah hey everyone welcome to the laden space podcast this is celesio partner and c and residents and deel partners and I'm joined by my co-host swix founder of small AI hey and today we're excited to be in the inperson studio with Michelle welcome thanks thanks for having me very excited to be here this has been a long time coming uh I've been following your work on the API platform for a little bit and uh I'm finally glad that we could make this happen after you you ships structured up how does that feel yeah it feels great uh we've been working on it for quite a while so very excited to have it out there and have people using it we'll tell the story soon uh but I want to give people a little intro to your backgrounds so you've interned and or worked at Google stripe coinbase Clubhouse and obviously open AI what was that Journey like uh you know I the one that has the most appealed to me is Clubhouse because that was a very very hot company for a while you basically you seem to join companies when they're about to scale up really a lot and uh obviously open the eye has been the latest but uh yeah just what what are your learnings and your history going into all these um notable companies yeah totally for a bit of my background uh I'm Canadian I went to the University of watero and there you do like six internships as part of your degree so I started uh actually my first job was really rough I worked at a bank and I learned Visual Basic and I like animated bond yield curves and it was you know not not me too oh really yeah that was a derivative Trader interest rate swaps that kind of stuff yeah yeah so I liked you know having a job but I didn't love that job uh and then my next internship was Google uh and I learned so much there it was tremendous but I had a bunch of friends that were into startups more and you know water has like a big startup culture and one of my friends uh interned at stripe and he said it was super cool so that was kind of my I also was a little bit into crypto at the time and I got into it on Hacker News and uh so coinbase was on my radar and so that was like my first real startup opportunity was coinbase I think I've never learned uh more in my life than in the 4-month period when I was interning coinbase they actually put me on call I worked on like the a rails there and it was it was absolutely crazy you know crypto was a very formative experience and this is 2018 to 2020 kind of like the first that was my fulltime uh but I was there as an intern in 201 16 yeah and so that was the period where I really like learned to become an engineer learned how to use git got on call right away you know managed production databases and stuff so that was super cool after that I went to stripe and kind of got a different flavor of payments on the other side learned a lot uh was really inspired by the cson and then my next internship after that I actually started a company at watero so there's this thing you can do it's an entrepreneurship Co-op uh and I did it with my roommate uh the company's called readwise which still exists but oh yeah yeah rewise what yeah awesome you co-found it rewise yeah premium User it's not even on your your your LinkedIn yeah I mean I only worked on it for about a year and so Tristan and Dan are the real Founders and and I just had an interlude there but uh yeah really loved working on something very startup focused user focused and and hacking with friends it was super fun eventually I decided to go back to coinbase and really like get a lot better as an engineer I didn't feel like I was you know didn't feel equipped to be a CTO of anything at that point and so just learned so much at coinbase and that was a really fun curve but yeah after that I went to clubhouse which was like a really interesting uh time so I wouldn't say that I went there before it blew up I would say I went there as it blew up so not quite the Starling track record that it might seem but it was a super exciting place I joined as like the second or third backend engineer and you know we were down every day basically you know one time Oprah came on and absolutely everything melted down and so we would have a stand up every morning and be like how do we make everything stay up um which is super exciting also one of the first things I worked on there was making our notifications go out more quickly because when you join a clubhouse room you know you need everyone to come in right away so that it's exciting and the person speaking thinks a lot of my audien is here but when I first joined I think it would take like 10 minutes for all the notifications to go out which is insane like you know by the time you want to start talking to the time your audience is there it's like you can totally kill the room so that's one of the first things I worked on is making that a lot faster and you know keeping everything up I mean so already we have an audience of Engineers uh those two things are useful it's keeping things up and notifications out notifications like is it a Kafka topic it was a postgress shop and you had all of the followers in postgress and you needed to like iterate over the followers and like figure out is this a good notification to send and so all of this logic it wasn't like well batched and parallelized and our job queuing infrastructure wasn't right and so there's a lot of like fixing all of these things um eventually there were a lot of database migrations because postrest just wasn't scaling well for us interesting and then think uh keeping things up that was more of a I don't know reliability issue Sr type A lot of it yeah it goes down to like database stuff um everywhere I work databases at coinbase at Clubhouse and at openingi postgress has been a perennial challenge it's like the stuff you learn at one job carries over to all the others because you're always debugging a long running post Quest Creer at 3:00 a.m. for some reason um so those skills have really carried me forward for sure why do you think that not as much of this is prized obviously post crisis an open source project is not aimed as like gigascale but you would think somebody would come around and say hey we're like the yeah I think that's what planet scale is doing kind of it's not on postgress I think it's on my squl but I think that's the vision it's like they have zero time zero down time uh migrations and that's a big pain point I don't know why no one is doing this on postgress but I think it would be pretty cool their connection poers like PG bouncer is like good enough I don't know yeah well even I mean I've run PG bouncer everywhere and it's there's still a lot of problems like your scale is something that not many people see so yeah I mean at some point every company gets the scale every successful company gets to the scale where postgress is not cutting it and then you migrate to some sort of nosql database and that process I've seen happen a bunch of times now mongod DP redis something like that yeah um I mean we're on Azure now and so there's we use Cosmos DV Cosmos DV hey at Clubhouse we I really love Dynamo DV that's probably my favorite database which is like a very nerdy sentence but that's the one I'm using if I need to scale something as far as it goes yeah DB I I um when I learned I worked at AWS briefly and it's kind of like the memory register for the web like yes you know if you treat it just as physical memory you will use it well if you treat it as a real database why you run into problems right you have to totally change your mindset when you're going from postest to Dynamo but I think it's a good mindset shift and kind of makes you design things in a more scalable way yeah I'll recommend a Dynamo DB book for people who need to use Dynamo DB but we're not here to talk about AWS we're here to talk about open ey you join open ey pre chbt I also had the opportunity to join and I didn't what was what was your Insight yeah I think a lot of people who joined open AI join because of a product that really gets them excited and for most people it's chat GPT but for me I was a daily user of co-pilot GitHub co-pilot and I was like so blown away at the quality of this thing I actually remember the first time seeing it on Hacker News and being like wow this is absolutely crazy like this is going to change everything and I started using it every day it just really I even now when like I'm I don't have service and and I'm coding without co-pilot it's just like 10x difference so I was really excited about that product I thought now is maybe the time for AI and I'd done some AI in college and thought some of those skill skills would transfer um and I got introduced to team I liked everyone I talked to so I thought thought it would be cool why didn't you join it was like I was like is Dolly it we were there we were at the dolly like launch thing and and I think you were talking with Lenny and uh Lenny was at open ey at the time and you were like we don't have to go into too much detail but this is one of my biggest regrets of My Life um but but well I was like okay I mean I can I can create images I don't know if like this is the thing to to dedicate but obviously you had a bigger Vision than than I did it was really cool too I remember like first showing my family I was like I'm going to this company and here's like one of the things they do and it like really helped bridge the gap whereas like I still haven't figured out how to explain to my parents what crypto is um my mom for a while thought I worked at Bitcoin so it's like it's pretty different to be able to tell your family what you actually do and they can see it yeah yeah and they can use it too personally so you were there were you immediately on API platform you were there for the chat gbt moment yeah I mean API platform is like a very grandiose term for what it was there was like just a handful of us working on the API yeah it was like a closed beta right not even everyone had access to the G3 a very different access model then um a lot more like tiered roll outs but yeah I I would say the applied team was maybe like 30 or 40 people and yeah probably closer to 30 and there was maybe like fiveish total working on the API at most so yeah we've grown a lot since then it's like 60 70 now right no applied is much bigger than that applied now is bigger than the company when I joined okay yeah we've grown a lot I mean there's so much to build so we need all I'm a little out of date yeah any ched gbt release kind of like all hands on deck stories had had lunch with um Evan morawa a few months ago it sounded like it was a fun time to get build the apis and have all these people trying to use the web thing like how are you prioritizing internally like what was the helping scaling when you're scaling non GPU workloads versus like postgress bouncers and things like that yeah actually surprisingly there were a lot of postgress issues um when chat GPT came out because the accounts for like chat GPT were tied to the accounts in the API and so you're basically creating a developer account to log into chat gbt at the time cuz it's just what we had it was lowkey research preview and so I remember there was just so much work scaling like our authorization system and that would be down a lot yeah also GPU you know I never had worked in a place where you couldn't just scale the thing up it's like everywhere work compute is like free and you just like Auto scale a thing and you like never think about it again but here we're having like tough decisions every day we're like discussing like you know should they go here or here and we have to be principled about it so that's a real mindset shift so you just really structured outputs congrats you also wrote the blog post for it which was really well written and I love all the examples that you put out like it really give the the full story yeah tell us about the whole story from beginning to end yeah I guess the story we should rewind uh quite a bit to Dev Day last year Dev Day last year exactly we shipped Json mode which is our first foray into this area of product so for folks who don't know Json mode is this functionality you can enable in our chat completions and other apis where if you opt in uh we'll kind of constrain the output of the model to match the Json language and so you basically will always get something in a curly brace and this is good this is nice for a lot of people you can like describe your schema what you want in prompt and you know we'll constrain it to Json but it's not getting you exactly where you want because you don't want the model to kind of make up the keys or like match different values than what you want like if you want an enum or a number and you get a string instead it's like pretty frustrating so we've been ideating on this for a while and like people have been asking for basically this every time I talk to customers for maybe the last year so it's really clear that there's developer need and we started working on kind of making it happen and this is a real collab between engineering and research I would say and so it's not enough to just kind of constrain the model I think of that as the engineering side whereas basically you Mass the available tokens that are produced every time to only fit the schema and so you can do this engineering thing and you can force the model to do what you want but you might not get good outputs and sometimes with Json mode developers have seen that our models output like Whit space for a really long time where they don't because it's a legal character right it's legal per Json but it's not really what they want and so that's what happens when you do kind of a very engineering biased approach but the modeling approach is to also train the model to do more of what you want and so we did these together we trained a model which is significantly better than our past models at following formats and we did the entor to serve like this constrained decoding concept at scill so I think marrying these two is is why this feature is pretty cool you just mentioned starts and an with a curly brace and maybe people's minds go to a prefills in the cloud API how should people think about Json mode structured output prefills because some of them are like roughly starts with a curly brace and ask you for Json you should do it and then instructor is like hey here's the rough data scheme I us should use and how do you think about them so I think we kind of designed structured outputs to be the easiest to use so you just like the way you use it in our SDK I think is my favorite thing so you just create like a pantic object or a Zod object and you pass it in and you get back an object and so you don't have to deal with any of the serialization the pars helper yeah you don't have to deal with any of the serialization on the way in or out so I kind of think of this as the feature for the developer who is like I need this to plug into my system I need the function call to be exact I don't want to deal with any parsing so that's where structured outputs is tailored whereas if you want the model to be more creative and use it to come up with Json schema that you don't even know you want then that's kind of where Json mode fits in but I expect most developers are probably going to want to upgrade to structured outputs the thing you just said you just use interchangeable terms for for the same thing which is Tool uh function calling and structured outputs we've had uh disagreements or discussion before on the podcast about are they the same thing semantically they're slightly different they are yes because I think function calling API came out first yes then Json mode and we Ed to abuse function calling for Json mode right do you think we should treat them as synonymous no okay yeah please clarify yeah and by the way there's also tool calling yeah the history here is we started with function calling and function calling you know came from the idea of like let's give the model access to tools and let's see what it does and we basically had these internal prototypes of of what a code interpreter is now and we were like this is super cool let's make it an API but we're not ready to host code interpreter for everybody so you know we're just going to expose The Rock capability and see what people do with it but even now I think there's a really big difference between function calling and structured outputs so you you should use function calling when you actually have functions that you want the model to call right and so like if you have a database that you want the model to be able to query from or if you want the model to send an email or like you know generate Arguments for an actual action and that's the way the model has been like fine tuned on is to like treat function calling for actually calling these tools and getting their outputs the new response format is a way of just getting the model to respond to the user but in a structured way and so this is very different like responding to a user versus like you know I'm going to go send an email a lot of people were hacking function calling to get the response format they needed and so this is why we shipped kind of this new response format so you can get exactly where you want and you get kind of more of the models for boss it's like kind of responding in the way it would speak to a user and so less kind of just programmatic tool calling if that makes sense are you building something into dsdk to actually close the loop with the function calling because right now it Returns the function then you got to run it then you got to like fake another message to then continue the conversation they have that in beta the runs yes we have this in beta in the node SDK so you can basically python it's coming to python as well that's why I didn't know see yeah I'm a node guy so Javas it's already existed it's it's coming everywhere but basically what you do is you write a function and then you add a decorator to it and then you can basically there's this run tools method and it does the whole Loop for you which is pretty cool when I saw that in the node SDK I wasn't sure if that's because it basically runs it in the same machine yeah and maybe you don't want that right to happen yeah I think of it as like if you're prototyping and building something really quickly and just playing around it's so cool to just create a function and give it this decorator but you know you have the flexibility to do it however you like like you don't want it in a critical path of a web request I mean some people definitely will um you know it's just kind of the easiest way to get started but let's say you want to like execute this function on a job Q async then you know it wouldn't make sense to use that prior art instructure outlines Json former what did you study what did you you know credit or learn from these things yeah there's a lot of different approaches to this there's more fill-in theblank style sampling where you uh basically preform kind of the keys and then get the model to sample just the value there's kind of a lot of approaches here we didn't kind of use any of them wholesale but we really loved what we saw from the community and like the developer experiences we saw so that's where we took a lot of uh inspiration there was a question also just about constrained grammar this is something that I I first saw in llama CPP which seems to be the most let's should say academically permissive forevel yeah for those who don't know maybe I don't know if you want to explain it but they use back as nor form which you only learn in like college when you're working on programming languages and compilers I don't know if you like use that under the hood or you explore that yeah we we didn't uh use any kind of other stuff U we kind of built you know our solution from from scratch to meet our specific needs but I think there's a lot of cool stuff out there where you can supply your own grammar right now we only allow Json schema and a dialect of that but I think in the future it could be a really cool extension to let you supply a grammar more broadly and maybe it's more token efficient than Json so lot of opportunity there you mentioned before also training the model to be better function calling what's that discussion like internally for like resour it's like hey we need to get better Json mode and it's like well can't you figure it out on the API platform without touching the model like is there a really tight collaboration between the two teams yeah so I actually work on the API models team I guess we didn't quite get into what I do an API yeah what do you say it is you do here yeah so yeah I'm the I'm the tech lead for the API but also I work on the API models team and this team is really working on making the best models for the API and a lot of common deployment patterns are research makes a model and then you kind of ship it in the API but you know I think there's a lot you miss when you do that you miss a lot of developer feedback and things that are not kind of immediately obvious what we do is we get a lot of feedback from developers and we go and make the models better in certain ways so our team does model training as well we work very closely with our post training team and so for structured outputs it was a collab between a bunch of teams including Safety Systems to make you know a really great model that does uh structured outputs mentioning Safety Systems you have a refusal field yes uh you want to talk about that that seems like a yeah it's a little it's pretty interesting so you can imagine basically if you constrain the model to follow a schema you can imagine there being like a a schema supplied that it wouldn't it would add some risk or be harmful for the model to kind of Follow That schema and we wanted to preserve our model's abilities to refuse uh when something you know doesn't match our policies or is harmful in some way and so we needed to give the model an ability to refuse even when there is this schema but also you know if you are a developer and you have this schema and you get back something that doesn't match it you're like ah the feature's broken so we wanted a really clear way for developers to program against this so if you get something back in the content you know it's valid it's Json Parable but if you get something back in the refusal field it makes for a much better UI for you to kind of display this to your user in a different way and it makes it easier to program against so really there was a few goals but is mainly to allow the model to continue to refuse but also with a really good developer experience yeah why not offer it as like an error code because we have to display error codes anyway yeah we flaff for a long time about API design as we are want to do and there are a few reasons against an error code like you could imagine this being a 4xx error code or something but you know the developer paying for the tokens and that's kind of atypical for like a 4xx error code we pay with errors anyway right 4xx is not that's that's a that's a u error right and it doesn't as a 5xx either because it's not our fault you know the way the API the model is designed and I think the HTTP spec is a little bit limiting for AI in a lot of ways like there are things that are in between your fault and my fault there's kind of like the model's fault and there's no you know error code for that so we really have to kind of invent a lot of the Paradigm here make a 6xx yeah that's one option there's actually some like esoteric error codes we've considered adopting 328 my favorite yeah there's uh yeah there's the teapot one we're still figuring that out but I think there are some things like for example sometimes our model will produce tokens that are invalid based on kind of our language and when that happens it's an error but you know it doesn't 500 is fine which is what we return but it's not as expressive as it could be so yeah just areas where you know Web 2.0 doesn't quite fit with AI yet if you had to put in a spec just change what would be your number one proposal to like rehaul the HTTP committee to reinvent the world yeah that's going I mean I think we just need an error of like a range of model error and we can have many different kinds of model errors like a refusal is a model error 601 model refusal yeah again like so we we've mentioned before that chat completions uses this chat ml format so when the model doesn't follow chat ml that's an error um and we're working on reducing those errors but that's like I don't know 602 I guess a lot of people actually don't no longer know what Chad ml is yeah because that was uh briefly introduced by open the eye and then like kind of deprecated everyone who introd who implements this underhood knows it but maybe the the API users don't know it basically the API started with just one endpoint the completions endpoint and the completions endpoint you just put text in and you get text out and you can prompt in certain ways then we released chat gbt and we decided to put that in the API as well and that became the chat completions API and that API doesn't just take like a string input and produce an output it actually takes in messages and produces messages and so you can get a distinction between like an assistant message and a user message and that allows all kinds of behavior and so the format under the hood for that is called chat ml sometimes you know because the model is so out of distribution based on what you're doing Maybe temperature super high then it can't follow chat ml yeah I didn't know that there could be errors generated there maybe I'm not asking challenging enough questions it's pretty rare and we're working on driving it down but actually this is a side effect of structured outputs now which is that we have removed a class of Errors we didn't really mention this in the blog just because we ran out of space but uh that's what we're here to do yeah the model used to um occasionally pick a recipient that was invalid um and this would cause an error but now we are able to to constrain to chat ml in a more valid way and this reduces a class of errors as well recipient meaning so there's there's like a a few number of defined roles like user assistant system so like recipient as in like picking the right tool um so oh so the model before was able to to hallucinate a tool but now it's uh it can't when you're using structured outputs do you collaborate with other model developers to try and figure out these type of Errors like how do you display them because a lot of people try to work with different models yeah is there any yeah not a ton we're we're kind of just focused on making the best API for developers a lot of research and Engineering I guess comes together with evals you published some evals there I think I think gorilla is one of them what is your assessment of like the state of evals for function calling and structured output right now yeah we've actually collaborated with uh bfcl a little bit which is I think the same function calling leaderboard kudos to the team those Evils are great and we use them internally yeah we've also sent some feedback on some things that are misgraded but and so we're we're collaborating to to make those better in general I feel evals are kind of the hardest part of AI like when we talk to developers it's so hard to get started it's really hard to make a robust Pipeline and you don't want evals that are like 80% successful because you know things are going to improve dramatically and it's really hard to craft the right eval you kind of want to hit everything on the difficulty curve I find that a lot of these evals are mostly saturated like for bfcl all the models are near near the top already and kind of the errors are more I would say like just differences and default behaviors I think most of the models on the leaderboard can kind of get 100% with different prompting but it's more kind of you're just pulling apart different defa defaults at this point so yeah I would say in general we're missing evals you know we work on this a lot internally but it's hard did you other than bfcl would you call out any others just for people explor into space sbench is actually like a very interesting eval if people don't know you basically give the model GitHub issue and like a repo and just see how well it does with the issue which I think is super cool it's kind of like an integration test I would say for models it's a little unfair right what do you mean a little unfair cuz like usually as a human you have more opportunity to like ask questions about what it's supposed to do and you're giving the model like way too little information a hard job to do the job but yeah s bench targets like how well can you follow the diff format and how well can you like search across files and how well can you write code so I'm really excited about eiles like that because the pass rate is low so there's a lot of room to improve yeah and it's just targeting a really cool capability I've seen other evals for function calling where I think might be BFC as well where they they evaluate different kinds of function calling and I think the the top one that people care about for some reason I I don't know personally that this is so important to me but it's parallel function calling right I think you confirmed that you're you have don't support that yet why is that hard just more context about it so yeah we put out parallel function calling Dev Day last year as well and it's kind of the evolution of function calling so function calling V1 you just get one function back function calling V2 you can get multiple back at the same time and save latency we have this in our API all our models support it or all of our newer models support it but we don't support it with structured outputs right now and there's actually a very interesting trade-off here uh so when you basically call our API for structured outputs with a new schema we have to build this artifact for fast sampling later on but when you do parallel function calling the kind of schema we follow is not just directly one of the function schemas it's like this combined schema based on a lot of them if we were kind of do the same thing and build an index every time you pass in a list of functions if you ever change the list you would kind of incur more latency and we thought it would be really unintuitive for developers and like hard to reason about so we decided to kind of wait until we can support a no added latency solution um and not just kind of make it really confusing for developers mentioning latency that is something that people discovered is that there is an increased cost and latency for the first token for the first request yeah first request is that an issue is that going to go down time is is there just an overhead to parsing Json that is just insurmountable it's definitely not insurmountable and I think it will definitely go down over time we just kind of take the approach of of ship early and often um and you know if you if there's nothing in there you you don't want to fix then you probably ship too late um so I think we will get that latency down over time but yeah I think for most developers it's not a big concern CU you're testing out your integration you're you're sending some requests while you're developing it and then it's fast and prod so kind of works for most people the alternative design space that we uh explored was like pre-registering your schema so like a totally different endpoint and then passing in like a schema ID but we thought you know that was a lot of overhead and like another end point to maintain and just kind of more complexity for the developer and we think this latency is going to come down over time so it made sense to keep it kind of in chat completions I mean hypothetically if one were to ship caching at a future point it would basically be the Super set of that maybe I think the in space is a little underexplored like we've seen kind of two versions of it but I think yeah there's ways that maybe put less onus on the developer but you know we haven't committed to anything yet but we're definitely exploring opportunities for making things cheaper over over time is AI in agents just going to be a bunch of structure output and function calling one next to each other like how do you see you know there's like the model does everything where do you draw the line because you don't call these things like an agent API but like if I were a startup trying to raise a c round I would just do function calling and say this is an agent API so how do you think about the difference and like how people build on top of it for like a gentic systems yeah love that question one of the reasons we wanted to build structured outputs is to make agentic applications actually work so right now it's really hard like if something is 95% reliable but you're chaining together a bunch of calls if you magnify that error rate it makes your like application not work so that's a really exciting thing here from going from like 95% to 100% I'm very biased working on the apepi and working on function calling and structured outputs but I think those are the building blocks that we'll be using kind of to distribute this technology very far it's the way you connect like natural language and converting user intent into working with your application and so I think like kind of there's no way to build without it honestly like you need your function calls to work like yeah we wanted to make that a lot easier yeah and do you think the assistance kind of like API thing will be a bigger part as people build agents I think maybe most people just use messages and completion and so I would say the assistance API was kind of a bet in a few areas one bet is hosted tools so we have the file Search tool and code interpreter another bet was kind of statefulness it's our first stateful API it'll store you know threads and you can fetch them later I would say the hosted tools aspect has been really successful like people love our file Search tool and it's like saves a lot of time to not build your own rag pipeline um I think we're still iterating on the shape for the stateful thing to make it as useful as possible right now there's kind of a few end points you need to call before you can get a run going and we want to work to make that you know much more intuitive and easier over time one thing I'm I'm just kind of curious about did you notice any tradeoffs when you add more structured output it gets worse at some other thing that was like kind of you didn't think was related at all yeah it's a good question yeah I mean models are very spiky and RL is hard to predict and so every model kind of improves on some things and maybe is flat or neutral on other things yeah like it's it's like very rare to just add a capability and have no trade-offs and everything else so yeah I don't I have something off the top of my head but I would say yeah every model is a special kind of its own thing this is why we put them in API dated so developers can choose for themselves which one works best for them in general we strive to continue improving on all evals but it's stochastic yeah able to apply the structured output system on backdated models like uh 40 may as well as mini as well as August actually the new response format yeah is only available on two models it's 4 and the new 40 okay so the old 40 doesn't have the new response format okay however for function calling we were able to enable it for all models that support function calling and that's because those models were already trained to follow these schemas we basically just didn't want to add the new response format to models that would do poorly at it because they would just kind of do infinite white space which is you know the most likely token if you have no idea what's going on I just wanted to call out a little bit more in the in the stuff you've T in blog post so in blog post just use cases right I just want people be like yeah we're spelling it out for you use these for extracting structured data from unstructured data by the way it does Vision 2 right so that's cool Dynamic UI generation actually let's talk about Dynamic UI um I think gen UI I think is something that people are very interested in yeah is your first example what did you find about it yeah I just thought it was a super cool capability we have now so the schemas we we support recursive schemas and this allows you to do really cool stuff like you know every UI is a nested tree that has children and so I thought that was super cool you can use one schema and generate like tons of of uis as a back-end engineer who's always struggled with Javas script in front end like for me that's super cool I've now we've now built a system where I can get any front end that I want so that's super cool the extracting structured data like the reality of a lot of AI applications is like you're plugging them into to your Enterprise business and you have something that works but you want to make it a little bit better and so the reliability gains you get here is like you'll never get a like a classification using the wrong enum it's like it's just exactly your your types so really excited about that like maybe hallucinate the actual values right so let's clearly State what the guarantees are the guarantees is that they fits the schema but the schema itself may be too broad because the Json schema type system doesn't say like I only want to range from 1 to 11 you might give me zero give me 12 so yeah Json schema so this is actually a good thing to talk about so Json schema is extremely vast and we weren't able to support every corner of it so we kind of support our own dialect and it's described in the docs and there are a few trade-offs we had to make there so by default if you don't pass in additional properties in a schema by default that's true and so that means you can get other Keys which you know you didn't spell out which is kind of the opposite of what developers want you basically want to supply the keys and values and you want to get those keys and values and so then we had a decision to make it's like do we redefine what additional properties means as the default and that felt really bad it's like there's a scheme of that's predated us like you know it wouldn't be good it would be better to play nice with the community and so we require that you pass it in as false you know one of our design principles is to be very explicit and so developers know you know what to expect and so this is one where we decided you know it's a little harder to discover but we think you should pass this thing in so that we can have like a very clear definition of what you mean and what we mean there's a similar one here with like required by default every key in Json scheme is optional but that's not what developers want right like you would you'd be very surprised if you passed in a bunch of keys and you didn't get some of them back and so that's the trade-off we made is to make everything required and have the developers spell that out is there a require false can people turn it off or they're just getting all so developers can basically what we recommend for that is to make your actual key a union type and so yeah make it Union of int and null and that gets you the same behavior any other of the examples you want to dive into math Chain of Thought yeah you can now specify like a Chain of Thought Field before a final answer this is just like a more structured way of extracting The Final Answer yeah one example we have I think we put up a demo app of this math tutoring example or it's coming out soon I miss it oh okay well basically it's this math tutoring thing and you put in an equation and you can go step by step and insert this is something you can do now with structured out in the past a developer would have to like specify their format and then write a parser and parse out the model's output which would be pretty hard but now you just specify like steps and it's an array of steps and every step you can render and then the user can try it and you can see if it matches and go on that way so I think it just opens up a lot of opportunities like for any kind of UI where you want to treat different parts of the model's responses differently structured outputs is great for that I remembered my my question from earlier I'm basically just using this to ask you all the questions as a user as a daily daily user of the stuff that you put out so one is a tip that people don't know and I confronted it to you on Twitter which is you respect descriptions of Json schemas right and you can basically use that as a prompt for the field totally I assume that's blessed and you know people should do that right one thing that I started to do which I don't it could be hallucination of me is I Chang the the property name to to to prompt the model to what I wanted to do so for example instead of saying topics as a property name I would say like brainstorm a list of topics up to five or something like that as as like a property name I I could stick that in the description as well but is that too much yeah I would say I mean we're so early in AI that people are figuring out the best way to do things and I love when I learn from a developer like a way they found to make something work in general I think there's like three or four places to put instructions yeah you can put instructions in the system message and I would say that's helpful for like when to call a function so it's like you know let's say you're building a customer support thing and you want the model to verify the user's phone number or something you can tell the model in the system message like here's when you should call this function then when you're within a function I would say the descriptions there should be more about how to call a function so really common is someone will have like date as a string but you don't tell the model like do you want year year month month day day or do you want that backwards and that's what a really good spot is for those kind of descriptions is like how do you call this thing and then sometimes there's like really stuff like what you're doing it's like name the The Key by what you want so sometimes people put like do not use and you know if they don't want you know this parameter to be used except only in some some circumstances and really I think that's the fun nature of this it's like you're figuring out the best way to get something out of the model okay so so you don't have official recommendation is what I'm hearing well the official recommendation is you know how to Cola model system instructions exactly exactly that function yeah do you Benchmark these type of things so like same with date it's like description it's like return it and like ISO a or if you call the key date in ISO a6001 I feel like the benchmarks don't go that that deep but then all the AI engineering kind of community like all the work that people do is like oh actually this performs better but then there's way to verify right you know like uh even the I'm going to tip you $100,000 or whatever like some people say it works some people say it doesn't do you pay attention to the stuff as you build this or are you just like the model is just going to get better so why waste my time running evals on these small small things yeah I would say to that I would say we basically pick our battles I mean there's so much surface area of llms that we could dig into and we're just Mo mostly focused on kind of raising the capabilities for everyone I think for customers and we work with a lot of customers really developing their own evals is super high leverage cuz then you can upgrade really quickly when we have a new model you can experiment with these things with confidence so yeah we're we're hoping to make making evals easier I think that's really generally very helpful for Developers for people I would just kind of wrap up the discussion for structured outputs I immediately implemented we use structured outputs for AI news I use instructor and I ripped it out and I think it I saved um 20 lines of code but more importantly it was like we cut it by 55 % of API cost based on what I what I measured because of we saved on the retries nice love to hear that yeah which which people I think don't understand when you you can't just simply like add instructor or add outlines you can do that but it's actually going to cost you a lot of retries to get the the model that you want but you're kind of just kind of building that internally into the model yeah I think this is the kind of feature that works really well when it's integrated with like the llm provider yeah actually I had folks even my my husband's company who works at a small startup they thought we were just retrying um so I had to make clear we are not retrying you know we're doing it in one shot and this is how you save on latency end cost awesome any other behind the scenes stuff just generally unstructured outputs we we're going to move on to the other models yeah I think that's it oh look that's an excellent product and I think everyone will be using it and we have the full story now that people can try out so road map would be parallel function calling anything else that you've called out as like coming soon uh quite soon but you know we're thinking about does it make sense to expose was custom grammars um Beyond Jon schema what would you want to hear from developers to give you information whether it's custom grammars or anything else about structured output like what would do you want to know more of just you know always interested in in feature requests what's not working but I'm I'd be really curious like what specific grammars folks want I know some folks want to match programming languages like python there's some challenges like with the expressivity of our you know implementation and so yeah just kind of the class of grammars folks want I have a very simple one which is a lot of people try to do use GPT as judge right which means they end up doing a rating system and then there's like 10 different kinds of rating systems there a lyer scale is whatever if there was an officially blessed way to do a rating system with structured outputs tot everyone would use it yeah yeah that makes sense I mean we often recommend using log probs with classification tasks so rather than like sampling you know let's say have four options like red yellow blue green rather than sampling you know two tokens for yellow you can just do like ABCD and get the log probs of those you know the inherent randomness of each sampling isn't taken into account and you can just actually look at what is the most likely token I think this is more of like a calibration question like if I ask you to rate things from 1 to 10 a non-calibrated model might always pick seven just like a human would right so like actually have a nice gradation from 1 to 10 would be the the the rough idea yeah and then even for structured outputs I can't just say have a field of rating from 1 to 10 because I I have to validate it and you know it might give me 11 yeah absolutely yeah so what about model selection now you have a lot of models when you first started you had one model endpoint I guess you had like the and then but like most people were using one model endpoint today you have like a lot of competitive models and I think we're nearing the end of the 3.5 run rip how do you advise people to like experiment select both in terms of like task and like cost like what's your playbook in general I think folks should start with 40 mini that's our cheapest model and it's a great work Workhorse uh works for a lot of great use cases if you're not finding the performance you need like you know maybe it's not smart enough then I would suggest going to 40 and if 40 works well for you that's great finally there's some like really Advanced Frontier use cases um and maybe 4 I is not quite cutting it and there I would recommend our fine-tuning API even just like a 100 examples is enough to get started there and you can really get the performance you're looking for we're recording this ahead of it but like you're announcing other some fine tuning stuff that people should pay attention to yeah actually tomorrow we're dropping our GA for gbt 40 fine tuning so 40 mini has been available for a few weeks now and 40 is now going to be generally available and we also have a free training offering for a bit I think until September 23rd you get 1 million of free training tokens a day this is already announced right oh was am I talking about a different so that was for 40 mini and now it's also for 40 so we're really excited to see what people do with it and it's actually a lot easier to get started than a lot of people expect they think they might need tens of thousands of examples but even 100 really high quality ones or a thousand is enough to get going oh well we might get a separate podcast just specifically on that but um you know we haven't confirmed that yet it basically seems like every time I think people's concerns about fine tuning is that they're kind of locked into a model and I think you're Paving the path for migration of models as long as they keep their original data set like they can at least migrate nicely yeah I'm not sure we've said publicly there yet but we definitely want to make it easier for folks to to migrate it's the number one concern you know I'm just you know it's obvious absolutely I also want to point people to you have official model sele
Original Description
Chapters
[00:00:00] Introductions
[00:06:37] Joining OpenAI pre-ChatGPT
[00:08:21] ChatGPT release and scaling challenges
[00:09:58] Structured Outputs and JSON mode
[00:11:52] Structured Outputs vs JSON mode vs Prefills
[00:17:08] OpenAI API / research teams structure
[00:18:12] Refusal field and why the HTTP spec is limiting
[00:21:23] ChatML & Function Calling
[00:27:42] Building agents with structured outputs
[00:30:52] Use cases for structured outputs
[00:38:36] Roadmap for structured outputs
[00:42:06] Fine-tuning and model selection strategies
[00:48:13] OpenAI's mission and the role of the API
[00:49:32] War stories from the trenches
[00:51:29] Assistants API updates
[00:55:48] Relationship with the developer ecosystem
[00:58:08] Batch API and its use cases
[01:00:12] Vision API
[01:02:07] Whisper API
[01:04:30] Advanced voice mode and how that changes DX
[01:05:27] Enterprise features and offerings
[01:06:09] Personal insights on Waterloo and reading recommendations
[01:10:53] Hiring and qualities that succeed at OpenAI
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Latent Space · Latent Space · 44 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
▶
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Ep 18: Petaflops to the People — with George Hotz of tinycorp
Latent Space
FlashAttention-2: Making Transformers 800% faster AND exact
Latent Space
RWKV: Reinventing RNNs for the Transformer Era
Latent Space
Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai
Latent Space
RAG is a hack - with Jerry Liu of LlamaIndex
Latent Space
The End of Finetuning — with Jeremy Howard of Fast.ai
Latent Space
Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue
Latent Space
Powering your Copilot for Data - with Artem Keydunov from Cube.dev
Latent Space
Beating GPT-4 with Open Source Models - with Michael Royzen of Phind
Latent Space
The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis
Latent Space
The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph
Latent Space
The AI-First Graphics Editor - with Suhail Doshi of Playground AI
Latent Space
The Accidental AI Canvas - with Steve Ruiz of tldraw
Latent Space
The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert
Latent Space
The Four Wars of the AI Stack - Dec 2023 Recap
Latent Space
The State of AI in production — with David Hsu of Retool
Latent Space
Building an open AI company - with Ce and Vipul of Together AI
Latent Space
Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal
Latent Space
A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate
Latent Space
Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI
Latent Space
Making Transformers Sing - with Mikey Shulman of Suno
Latent Space
A Comprehensive Overview of Large Language Models - Latent Space Paper Club
Latent Space
Why Google failed to make GPT-3 -- with David Luan of Adept
Latent Space
Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI
Latent Space
Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit
Latent Space
Breaking down the OG GPT Paper by Alec Radford
Latent Space
High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor
Latent Space
This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)
Latent Space
LLM Asia Paper Club Survey Round
Latent Space
How to train a Million Context LLM — with Mark Huang of Gradient.ai
Latent Space
How AI is Eating Finance - with Mike Conover of Brightwave
Latent Space
How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)
Latent Space
State of the Art: Training 70B LLMs on 10,000 H100 clusters
Latent Space
The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka
Latent Space
Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI
Latent Space
[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models
Latent Space
Synthetic data + tool use for LLM improvements 🦙
Latent Space
RLHF vs SFT to break out of local maxima 📈
Latent Space
The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)
Latent Space
Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson
Latent Space
Answer.ai & AI Magic with Jeremy Howard
Latent Space
Is finetuning GPT4o worth it?
Latent Space
Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind
Latent Space
Building AGI with OpenAI's Structured Outputs API
Latent Space
Q* for model distillation 🍓
Latent Space
Finetuning LoRAs on BILLIONS of tokens 🤖
Latent Space
Cursor UX team is CRACKED 💻
Latent Space
Choosing the BEST OpenAI model 🏆
Latent Space
How will OpenAI voice mode change API design?
Latent Space
STEALING OpenAI models data 🥷
Latent Space
[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!
Latent Space
[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval
Latent Space
The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org
Latent Space
llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE
Latent Space
Prompt Engineer is NOT a job 📝
Latent Space
Prompt Mining LLMs for better prompts ⛏️
Latent Space
The six pillars of few-shot prompting 🔧
Latent Space
Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph
Latent Space
[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)
Latent Space
Can you separate intelligence and knowledge?
Latent Space
More on: LLM Engineering
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
You Are Not Behind. The World Is.
Medium · AI
Career choice with the advent of AI - pure Computer Science or learn software with a background of core engineering area
Dev.to AI
The AI Hype Cycle: Calm Before the Next Breakthrough?
Medium · Programming
AI won’t replace scientists. It will make the current model of science obsolete
Medium · Data Science
🎓
Tutor Explanation
DeepCamp AI