Is finetuning GPT4o worth it?

Latent Space · Beginner ·🧠 Large Language Models ·1y ago

Skills: Fine-tuning LLMs90%LLM Foundations80%LLM Engineering70%

Key Takeaways

The video discusses the effectiveness of fine-tuning GPT-4 and its applications in software engineering, with a focus on Latent Space's Genie platform, which utilizes retrieval-augmented generation and fine-tuning to improve code writing and retrieval accuracy. Tools and techniques such as GPT-4, Codex, and perx API are demonstrated, along with the importance of data cleaning, alignment, and sharing for model development.

Full Transcript

hey everyone welcome to the laden space podcast this is alesio partner and CTO and Residence at deible partners and I'm joined by my co-host swix founder of small AI hey and today we're back in the studio in person after about 3 to four months in Visa jail and travels and all other fun stuff that we talked about in the previous episode uh but today with have special guest Ali pen from cosign welcome hi thanks for having me very lucky to have you because you're on a two-day trip toan franisco I would not recommend it don't fly from London to San Francisco for two days and you launched genie on a plane on plane Wi-Fi um claiming state-ofthe-art in sweet Ben which we're all going to talk about I'm excited to dive in into into your whole journey because it has been a journey I've been lucky to be a small angel in in part of that journey and it's exciting to see that you're launching to such a such a a claim and you know such results um so I'll go over your brief background and then you can s of fill in the blank say on on you know what else people should know about you you did your bachelors in computer science and exitor and then you worked at a startup that got acquired into gopuff and around about 2022 you started working on a stealth startup that became a YC startup what what's that story Yeah so basically when I left University I I met my now co-founder Sam at the time we were both mobile devs he was an Android developer I was an iOS Developer and Wall Street University we built this sort of small consultancy sort of we'd um be approached to build projects for people and we would just take them up and start with they were student projects they weren't they weren't anything crazy or anything big we started with those and over time we started doing larger and larger projects more interesting things and actually when we left University we just kept doing that we didn't really get jobs traditional jobs it was also like in the middle of Co middle of lockdown so we were like this is a pretty good gig we'll just keep like writing cod in our bedrooms and we did that for a while and then a friend of ours that we went to ex to with started a YC startup during covid and it was one of these fast grocery delivery companies at the time I was living in the deepest darkest Countryside in England where fast grocery companies are still not a thing so he he sort of pitched me this idea and was like listen like I need an iOS Dev do you fancy coming along and I thought absolutely it was a chance to get out my parents house chance to move to London you know do interesting things and at the time truthfully I had no idea what YC was I I had no idea I wasn't in the startup space I knew I liked coding and building apps and stuff but I'd never never really done anything in that area so I said yes absolutely I moved to London just sort of as covid was ending and yeah worked at what was fancy for about a year and a half then we brought Sam along as well so we Sam and I were the two engineers at fancy for basically its entire life and we built literally everything so like the the front the client mobile apps the the back ends the internal like stock management system the driver routing algorithms all those things literally like everything it was my first you know both of us were super in experienced we didn't have like proper engineering experience there were definitely decisions we'd do differently now we' definitely buy a lot of stuff off the shelf stuff like that but it was the initial dip of the toe into like the world of startups and we were both like hooked immediately like this is so cool this sounds so much better than all our friends who were like consultants and doing like normal jobs right we did that and it ran course and after I want to say 18 months or so gopuff came and acquired us and there was obviously a transitionary period and integration period like with all the Acquisitions and we did that and as soon as we vested what we wanted to vest and as soon as we thought okay this chapter is sort of done uh in about 20122 we left and we knew that we wanted to go alone and try something like we'd had this taste now we knew we we'd seen how like a YC startup was managed like up close and we knew that we wanted to do something similar ourselves we had no idea what it was at the time we just knew we wanted to do something so we we tried a small um some small projects in various different areas but then Sam talked to me about gpt3 he'd seen it on Reddit and I the source of all knowledge loves Reddit I'd actually heard of gpt2 and obviously had like Loosely followed What open AI had done with what was the game they trained a model to play was it DOTA yeah so I i' followed that and I knew Loosely what gbt2 was I knew what Bert was was so I was like okay this gpt3 thing sounds interesting and he just mentioned it to me on a walk and I then went home and like Googled gpt3 and there was the playground it was the and the model was Da Vinci 2 at the time and it was just the the old school playground completions nothing crazy no chat no nothing that Miss completion still yeah oh completion honestly I had this conversation in open hour yesterday I was like I just I know but yeah so we we um I started playing around with the playground and the first thing I wrote into it was like hello world and it gave me some sort of like fairly generic response back and I was like okay that looks pretty cool the next thing was I looked through the docs um Al They had a lot of example prts cuz I had no idea I didn't know if the if you could put anything in I didn't know if you had to structure it in a certain way or whatever and I and I saw that it could start writing like tables and Jason and stuff like that so I was like okay can you write me something in Jason and it did and I was like oh wow this is this is pretty cool um Can it can you can just write arbitary Jason for me and um immediately as soon as I realized that my mind was racing and I like got Sam in and we just started messing around in the playground like fairly innocently to start with and then of course both being mobile devs and also seeing at that point we learned about what the Codex model was as like this thing's trained to write code sounds awesome and co-pilot was start I think I I can't actually remember if co-pilot had come out yet or that it might have done it's round about the same time about the same time yeah and we were like okay as mobile Dev let's see what we can do so the initial thing was like okay let's see if we can get this AI to build us a mobile app from scratch we eventually built the world's most flimsy system which was back in the day like 4,000 token context Windows like chaining prompts trying to keep as much context from one to the other all these different things where essentially you'd put in an app idea in a box and then we'd do like very high level stuff figuring out what the stack should be figuring out um what the front end should be written in back end should be written in all these different things and then we'd go through like for each thing more and more levels of detail until the point that you actually got codex to write the code for each thing and we didn't do any templating or anything we were like no we're going to write all the code from scratch every time which is basically why it barely worked but there were like occasions where you could put in something and it would build something that did actually run the back end would run the database would work and we were oh my God this is insane this is so cool and that's what we showed to our co-founder Yang I met my co-founder Yang through through fancy cuz his wife was their first employee and um we showed him he was like you've discovered fire what is this like this is insane he has a lot more startup experience historically he's had a few exits in the past and has been through all different Industries he's like our dad he's a bit older he hates me saying that but he's he's your Co now he's our Co yeah and we showed him and he was like this is absolutely amazing let's just do something because he he at the time um was just about to have a child so he didn't have anything going on either so we we applied to YC got an interview the interiew interview was as most YC interviews are shortcut and pretty brutal they told us they hated the idea and they didn't think it would work and that's when we started brainstorming it was almost like the interview was like an office house kind of thing and we were like okay given what you know about the space now and how to build things with with these llms like what can you bring out of what you've learned in building that thing into something that might be a bit more useful to people on The Daily and also y obviously likes B2B startups a little bit more at least at the time they did back then so we were like okay maybe we could build something that helps you with existing code bases like can sort of automate development stuff with existing code bases not knowing at all what that would look like or how you would build it or any of these things and they were like yeah that sounds interesting you should probably go ahead and do that you're in you've got two weeks to build this an MVP and we were like okay okay we did our best the MVP was absolutely horrendous it was a CLI tool it sucked and um at the time we were like we we don't even know how to build what we want to build and we didn't really know what we wanted to build to be honest like we knew we wanted to try to help automate Dev work but back then we just didn't know enough about how llm apps were built the intricacies and all those things and also like the llm themselves like 4,000 tokens you're not going very far they're extremely expensive so we ended up building a uh a code based retrieval tool originally our thought process originally was we want to build something that can do our jobs for us that is like the gold star we know that we've seen like there are glimpses of it happening with our initial demo that we did but we don't see the path of how to do that at the moment like the tech just wasn't there so we were like well there are going to be some things that you need to build this when the tech does catch up so retrieval being one of the most important things like the model's going to have to be able to like pull code out of a code base somehow so we were like well let's just build the tooling around it and eventually when the tech comes then we'll be able to just like plug it into our our tooling and then it should work basically and to be fair that's basically what we've done and that's basically what's happened which is very fortunate but in the meantime whilst you're waiting for everything to sort become available we built this codebase retrieval tool that was the first thing we ever launched when we were in YC like that and it didn't work it was really frustrating for us cuz it was just me and Sam like working like all hours trying to get this thing to work it was quite a big task in of itself trying to get like a good semantic search engine working that could run locally on your machine we were trying to avoid sending code to the cloud as much as possible and then for very large code bases you're like you know millions of lines of code you're trying to do some sort of like local hnsw thing that runs inside your vs code instance that eats all your RAM as you've seen in the past all those different things yep yeah my first call with you like had TR it sucks man I yeah no I know I know it sucks I'm sorry um but building all that stuff was essentially the first 6 to8 months of of what at the time was built which by the way build build yeah terrible terrible that was the worst um part of trying to think about whether I would invest is whether or not people could pronounce pronounce the name no when we so when we went on our first ever YC like retreat no one got the name right they were like build build build well um and then we actually changed the names cosign like although some people spell it co as in like as if you're cosigning for an apartment or something like that's like can't win yeah that was what built was back then but the ambition and and I did a talk on this back in the end of 2022 the ambition to like build a something that essentially automated our jobs was still very much like core to what we were doing but for a very long time it was just never apparent to us like how would you go about doing these things even when like you had 3.5 16k 16k suddenly felt huge because you've gone from 4 to 16 but even then 16k is like a lot of python files are longer than 16k so you can't you know before you even start doing completion even then we were like eh yeah it looks like we're still waiting and then like towards the end of last year you then start you see 32k 32k was really smart it was really expensive but also like you could fit a decent amount of stuff in it 32k felt enormous and then finally 128k came along we were like right this is like this is what we can actually deal with because fundamentally to build a product like this you need to get as much information in front of the model as possible and make sure that everything ever writes in output can be traced back to Something in the context window so it's not hallucinating it as soon as that model existed I was like okay I know that that's this is now going to be feasible in some way we' done early sort of Dev work on Genie using 3.56k and that was a very very like crude way of proving that this Loop that we were after and The Way We Were generating the data actually had signal and worked and and could do something but the model itself was not useful because you couldn't ever fit enough information into it for it to be able to do the task competently and also the base intelligence of the model I mean 3.5 anyone who's used 3.5 knows the base intelligence the model is is lacking especially when you're asking it to like do software engineering is quite quite involved so we saw the 128k context model and um at that point we'd been in touch with open AI about our Ambitions and like how we wanted to build it we essentially I just took a punt I was like I'm just going to ask to see can we like train this thing because at the time for Turbo had just come out and back then there was still a decent amount of lag time between like open ey releasing a model and then allowing you to fine-tune it in some way they've gotten much better about that recently like for fine tuning came out either I think a day for a mini came out like a day after the model did and I know that's something they're definitely like optimizing for super heavily inside which is great to see which is a little bit you know for a year or so YC companies had like a direct slack channel to open AI we still do yeah yeah I so it's a little bit of the diminishing of the YC Advantage there if they're releasing this fineing ability like a day after yeah no no absolutely but like you can't bu able a startup on the YC Advantage it's obviously nice it makes you feel warm and fuzzy inside but like at the end of the day it's not that that's going to make you win yeah but yeah no so like we we' spoken to shaml there their Devo I'm sure you know him um Sol hit of solutions or something he is in their applied team yeah we'd been talking to him from the very beginning when we' got into YC and he's been absolutely fantastic throughout I basically had pitched him this idea back when we were doing on 3.5 16k and I was like this is my this is my crazy thesis I want to see if this can work and as soon as like that 128k model came out I was I started like laying the ground workor I was like I know this definitely isn't possible cuz you released it like yesterday but know that I want it and in the interim like GPT 4 like 8K fine tuning came out we tried that it's obviously even fewer tokens but the intelligence helped and I was like if we can marry the intelligence and the context window length then we're going to have something special and eventually we were able to get on the experimental access program and we got access to for Turbo fine-tuning as soon as we did that because in the entire run up to that we built the data pipeline we already had all that set up so we're like right we have the dat now we have the model let's put it through and and iterate essentially and that's that's where like Genie is we know it today really was born I won't pretend like the first version of Gene that we trained was good it was a disaster that's where you realize all the implicit biases in your data set and you realize oh actually this decision you made that was fairly arbitrary was the wrong one you have to do it a different way other subtle things like you know how you write get diffs in you using llms and how you can best optimize that to make sure they actually apply and work and loads of different to the ledge cases but as soon as we had access to the underlying tool we were like right we can actually do this and I was I breathe the sign relief CU I did I didn't know it was like it wasn't a done deal but I knew that we could build something useful and I knew that we could build something that um would be measurably good on whatever eval at the time that you wanted to use like at the time back then we weren't actually that familiar it was but once Devin came out and they announced their swe Ben SC I like that's when my life took a turn challenge accepted yeah challenge accepted and that's where like yes that's where my my friendships have gone my sleep has gone like my weight everything g into sweet bench and yeah we we it was actually a very useful tool in building Genie because beforehand it was like let Vibe check this thing and see if it's useful and then all of a sudden you have an actual measure to to see like could it do software engineering not not the best measure obviously but like it's a it's the best that we've got now we were just iterated and and built and eventually we got it to the point where it is now now and a little bit beyond since we actually like did we actually got that score a couple of weeks ago and yeah it's been a hell of a journey from the beginning all the way now that was a very rambling answer to your question about how we got here but that's essentially a potted yeah answer how we got here got the full origin story out yeah no totally you mentioned bios and the data and some of these things in your announcement video you called Genie the worst first AI software engineering colleague and you kind of highlighted how the data needed to train it needs to show how a human engineer works I think maybe you're contrasting that to just putting code in it there's kind of like a lot more than code that goes into software engineering correct how do you think about the data mixture you know and like uh there's this kind of known truth that code makes models better when you put in the pre-training data but since we put so much in the pre-training data what else do you add when you try to geni in yeah I think well there's that I think that sort of boils down fundamentally to the difference between a model writing code and a model doing software engineering because is that the the software engineering sort of discipline goes wider because if you look at something like a PR that is obviously a artifact of some thought and some work that has happened and has eventually been squashed into you know some diffs right what the very crudely what the pre-trained models are reading is they're reading those final diffs and they're emulating that and they're being able to Output it right but of course it's a super lossy thing a PR you have no idea why or how for the most part unless there are some comments which you know anyone who's worked in a company realizes PR reviews can be a bit dodgy at times but you see that you lose so much information at the end and and that's perfectly fine because PRS aren't designed to be something that perfectly preserves everything that happened but what we realized was if you want something that's a software engineer and very crudely we started with like something that can do PRS for you essentially you need to be able to figure out why those things happened otherwise you're just going to rely you essentially you just have a code writing model you have something that's good at human eval but but not very good at s bench essentially that realization was was part of the the kernel of the idea of of the approach that we took to design the agent that that is Genie the way that we decided we want to try to extract what happened in the past like as forensically as possible has been and is currently like one of the the main things that we focus all our time on because doing that as getting as much signal out as possible doing that as well as possible is the biggest thing that that we've seen that determines how well we do on that Benchmark at the end of the day once you've sorted things out like how like output structure how how to get it consistently writing diffs and all the stuff that is sort of ancillary to the model actually figuring out how to solve a problem the core bit of solving the problem is how did the human solve this problem and how can we best come up with how the human solve these problems so all the effort went in on that on that Pipeline and the mix that we ended up with was as you've probably seen in the technical report and so on all of those different languages and different combinations of different task types all of that has run through that Pipeline and we've extracted all that information out how does said theer when you work with customers that have private workflows like do you think is there usually a big Delta between what you get in open source and maybe public data versus like yeah when you scrape enough of it most of Open Source is updating readmes and docs it's hilarious like we had to filter out so much of that stuff because when we first did the 16 3.5 16k model like the amount of read me updating that went in we did like no data cleaning no real like we just sort of threw it in and saw what happened and it was just like it was really good at updating readmes really good at like writing some comments really good at um complaining in git reviews in in PR reviews rather and it would again like we didn't clean the data so you'd like give it some feedback and it would just like reply like it would just be quite insubordinate when it was getting back to like no I don't think you're right and they'll just sort of argue with you so the process of of doing all that was super interesting because we realized from the beginning okay there's a huge amount of work that needs to go into like cleaning this getting it aligned with what we want the model to do to be able to get the model to be useful in some way I'm curious like how do you think about the customer willingness to share all of this historical data I've done a lot of developer tools investing in my career and uh getting access to the code base is always one of the hard things are people getting more cautious about sharing this information in the past it was maybe like you know you're using static analysis tool or like whatever else you need to plug into the code base fine now you're building a model based on it like what's the discussion going into these companies are most people comfortable with like letting you see how they work and sharing everything or it depends on the sector mostly we've actually seen I'd say people becoming more amable to the idea over time actually rather more skeptical because I think they can see the the upside if this thing does what they say it does it's going to be more help to us than it is a risk to our infc um um and of course like companies building in this space we're all going to end up you know complying with the same rules and there are going to be new rules that come out to make sure that we're looking at your code that everything is safe and so on so from what we've seen so far we've spoken to some very large companies that you've definitely heard of and all of them obviously have stipulations and many of them want it to be sandbox to start with and all the like very obvious things that I you know I would say as well but they're all super Keen to have a go and see because like despite all those things if we can genuinely make them go faster allow them to build more in a given time period and stuff it's it's super worth it to them okay I'm going to dive in a little bit on the process that you have created you showed the demo on on your video and by the time that we release this you should be taking people off the weit list and launching people so people can see this themselves there's four main parts of the workflow which is finding files planning action writing code and running tests and controversially you have set yourself apart from the devans of the World by saying that things like having access to a browser is not that important for you is that an accurate reading of what you wrote I don't remember saying that but at least with what we've seen the browser is helpful but it's not as helpful as like ragging the correct files if that if that makes sense like it is still helpful but obviously there are there are more fundamental things you have to get right before you get to like oh yeah you can read some docs or you can read a stack Overflow article and stuff like that yeah the phrase I was indexing on was the other software tools are wrappers around foundational models with a few additional tools such as web browser or code interpr oh I see no I mean I'm not I'm not I'm not I'm deriding the the the approach that not the not the tools yeah exactly so like I would say in my standard model of what a code agent should look like uh Devon has been very influential obviously because you could just at the Docks of something and you know now I have now when I'm installing a new library I can just add docs and cursor also does this right and then obviously having a code interpreter does help I guess you have that in the form of running tests I mean Genie has both of those tools available to it as well so yeah yeah so we have a tool where you can like put in URLs and it will just read the URLs and you can it also uses perx API under the hood as well to be able to actually ask questions if it wants to okay so now we use both of those tools as well like those tools are super important and super key I I think obviously the most important tools to these agents are like being able to retrieve code from a code base being able to read stack over for articles and and what have you and just be able to essentially be able to Google like we do is definitely super useful yeah I thought maybe we could just kind of dive into each of those actions Code retrieval one of the core problems you had an indexer that you worked on uh even as as built what makes it hard what approach you thought would work didn't work anything like that it's funny I had a similar conversation to this when I was chatting to the guys from open yesterday the thing is that searching for code specifically semantically at least to start with I mean like keyword search and stuff like that is is a Sol problem it's been around for ages but at least being able to the phrase we always used back in the day was searching for what code does rather than what code is like searching for functionality is really hard really hard the way that we approached that problem was that obviously like a very basic and easy approach is right let's just embed the codebase we'll chunk it up in some arbitary way maybe using an as maybe using number of lines maybe using whatever like some overlappings just chunk it up and embed it and once you've done that I will write a query saying like find me some authentication code or something embed it and then do the coine similarity and get the top K right that doesn't work and I wish it did work don't get me wrong it doesn't work well at all because fundamentally if you think about like semantically how code looks is very different to how English looks and there's like not a huge amount of of signal that's carried between the two so what we ended the first approach we we took and and the kind of did well enough for a long time was okay let's train a model to be able to take in English code queries and then produce a hypothetical code snippet that might look like the answer embed that and then do the C on similarity and that process although very simple gets you so much more performance out of the retrieval accuracy and that was kind of like the start of our of our engine as we called it which is essentially like the aggregation of all these different her istics like semantic keyword LSP and so on to and and then we essentially had like a a model that would given an input choose which ones it thought were most appropriate given the type of requests you had so the whole code search thing was a really hard problem and actually what we ended up doing with Genie is we um let the model through selfplay figure out how to retrieve code so actually we don't use our engine for Genie so instead of like a request coming in and then like say gbt 4 with some Json output being like well I think here we should use a keyword with these inputs and then we should use some antic and then we should like pick these results it's actually like a question comes in and Genie has self- played in its training data to be able to be like okay this is how I'm going to approach finding this information much more akin to how a devel a developer would do it because if I was like Sean go into this new Cod base you've never seen before and find me the code that does this you're going to probably you might do some words you're going to look over the file system you're going to try to figure out from the directories and the file names where it might be you're going to like jump in one and then once you're in there you're probably going to be doing the you know go to definition stuff to like jump from file to file and try to use the graph to like get closer and closer and that is exactly what Genie does starts on the file system looks at the file system picks some candidate files is this what I'm looking for yes or no if there's something that's interesting like an import or something it can it can command click on that thing go to definition go to references and so on and it can Traverse the the code based that way are you using the vs code uh LSP or no that's no we're not long we're not doing this in vs code we're just using the language servers running but we really wanted to try to mimic the way we do it as best as possible and we did that during the selfplay process when we were generating the data set so although we did all that work originally and and although like Genie still has access to these tools so it can do keyword searches and it can do you know basic semantic searches and it can use the graph it uses them through this process and and figures out okay I've learned from data how to find stuff in code bases and I think in our technical report I can't remember the exact number but I think it was around 65 or 66% retrieval accuracy overall measured on we know what lines we need for these tasks to find for the task to actually be able to be completed and we found about 66% of all those lines which is one of the biggest areas of free performance that we can get a hold of because when we were building Genie truthfully like a lot more Focus went on assuming you found the information you've been able to reproduce the issue assuming that's true how do you then go about solving it and the bulk of the work we did was on the solving but when you go higher up the funnel obviously like the funnel looks like have you found everything you need for the task are you able to reproduce the problem that's seen in the issue are you then able to solve it and the funnel gets narrower as you go down and at the top of the funnel of course is raged so I'm actually quite happy with that score I think it's still pretty impressive considering the size of some of the code bases we're doing we're using for this but as soon as that if that number becomes 80 think how many more tasks we get right that's one of the key areas we're going to focus on when we continue working on Genie be interesting to break out a benchmark just for that um just to try because I don't know what stateof the art is yeah I mean like for a um it's super easy because like for a given PR you know what lines were edited oh okay yeah you know you can just you can s it from sub bench actually yeah you can do it you can do it with super easily and that's how we got that figure out at the other end um for us being able to see it against um our historic models were super useful so we could see if we were you know actually helping ourselves or not and initially one of the biggest performance gains that we saw when we were work when we did work on the ragab bit was giving it the ability to use the LSP to like go to definition and really try to get it to emulate how we do that because I'm sure when you go into an editor without where where like the LSP is not working or whatever you suddenly feel really like disarmed and naked you're like oh my God I didn't realize how much I actually use this to get about rather than just fine stuff so we really tried to get it to do that and that gave us a big jump in performance so we went from like 54% up like the 60s but just by adding focusing on that it's one weird trick yes um I I'll briefly comment here so this is the standard approach I would say most uh code tooling startups are pursuing the one company that's not doing this is Magic dodev yes so would you do things differently if you have a 10 million token context window if I had a 10 million context window and hundreds of millions of dollars I wouldn't have gone and built uh it's an LTM it's not Transformer right that they're using right if I'm not mistaken I believe it's not a Transformer yeah Eric's going to come on at some point I'm listen they obviously know a lot more about their product than I do I don't know a great deal about how magic Works anything yeah don't I don't I'm not so I'm not going to I'm not going to speculate would I do it the same way as them I like the way we've done it because fundamentally like we focus on the act of software engineering and what that looks like and showing models how to do that fundamentally the underlying model that we use is kind of null to us like so long as it's the best one I don't mind and the context Windows we've already seen you can get Transformers to have like million one and a half million token context windows and that works perfectly well so like as soon as you can fine tune Gemini 1.5 then you best be sure that Genie will will work will run a Gemini 1.5 and like we'll probably get very good performance out of that I like our approach because we can be super agile and be like oh well anthropic have just released whatever uh you know and it might have half a million tokens and it might be really smart and I can just immediately take my Jason L file and just dump it in there and suddenly Genie works on there and it can do all the the new things does anthropic have the same fine tuning support as open ey um actually haven't heard any are working on it they are partner they are partnered with AWS and it's going to be in Bedrock as far as as far as I know I think I'm I I think I think that's true um yeah we have to keep moving on to the other segments uh planning the second piece of your four-step grandmas plan that is the frontier right now you know a lot of people are talking about strawberry qar or whatever that is Monte Carlo research is current state-of-the-art planning good enough what prompts have worked I don't even know what questions to ask like what is the state of planning I think it's fairly obvious that with the foundational models like you can ask them to think by step by step and ask them to plan and stuff but that isn't enough because if you look at how those models score on these benchmarks and then they're not they're not even close to state of which ones are you referencing bench so like just like sweet bench and and so on right and like even the things that get really good scores on human Valor agents as well because they have these Loops right obviously these things can reason quote unquote but the reasoning is the model like it's constrained by the model's intelligence I'd say very crudely and what we essentially wanted to do was we still thought obviously reasoning is super important we need it to get the performance we have but we wanted the reasoning to emulate how we think about problems when we're solving them as opposed to how a model thinks about a problem when we're solving it and that was that's obviously part of like the derivation pipeline that we have when we when we when we design our data but the reasoning that the models do now and and who knows what qar whatever ends up being called looks like but certainly what I'm excited on on a small tangent to that like what I'm really excited about is when models like that come out obviously the signal in my data when I regenerate it goes up and then I can then train that model that's already better at reasoning with improved reasoning data and just like I can keep bootstrapping and keep leap frogging every single time and that is like super exciting to me because I don't I welcome like new models so much because immediately it just floats me up without having to do much work which is always nice but the state of reasoning generally I don't see it going away anytime soon I mean that's like an auto regressive model doesn't think per se and in the absence of having any thought maybe U an energy based model or something like that maybe that's what qar is who knows some sort of like high level abstract space where thought happens before tokens get produced in the absence of that for the moment I think it's it's all we have and it's going to have to be the way it works for what happens in the future we'll have to see but I think certainly it's never going to hinder performance to do it and certainly the reasoning that we see Genie do when you compare it to like if you ask gp4 to break down step-by-step and approach for the same problem at least just on a Vibe check alone looks far better two elements that I like that I didn't see in your initial video we'll see when you know this um Genie launches is a planner chat which is I can modify the plan while it's executing and then the other thing is playbooks which also from Devon where here's how I like to do a thing and I'll use markdown to specify how I do it I'm just curious if if like you know those things help yeah no absolutely we're 100% we want everything to be editable not least because it's really frustrating when it's not like if you're ever if you're ever in a situation where like this the one thing I just wish I could and you'd be right if that one thing was right and you can't change it so we're going to make everything editable including the code it wres like you can if it makes a small error in a patch so you can just change it yourself and let it continue and it will be fine yeah so yeah like those things are super important we'll be those two I'm curious once you get to writing code is most of the job done I feel like the models are so good at writing code when they're like in small chunks that are like very well instructed what's kind of the drop off in the funnel like once you get to like you got the right files and you get the right plan that's a great question because by the time this is out there'll be another BL there'll be another blog post yeah there'll be another blog post which uh contains all the information all the learnings that I delivered to open I fine tuning team when we finally got the schore oh it's okay go for it it's already out and um yeah yeah I I don't have it on my phone but basically I um broke down the log probs I basically got the average log prob for a token at every token position in the context window so imagine an xaxis from 0 to 128k and then the average log prob for each index in there as we' discussed like the way Genie Works normally is you know at the beginning you do your Rag and then you do your planning and then you do your coding and that sort of cycle continues the certainty of code writing is so much more certain than every other aspect of Genie's Loop so whatever is going on under the hood the model is really comfortable with writing code there is no doubt and it's like in in the token probabilities one slightly different thing I think to how most of these models work is at least for the most part if you ask GPT 4 in chat GPT to to edit some code for you it's going to rewrite the entire snippet for you with the changes in place we train Genie to write diffs and you know essentially patches right because it's more token efficient and that is also fundamentally we don't write patch patches as humans but it's like what the result of what we do is a patch right when Genie writes code I don't know how much it's leaning on the pre-training like code writing Corpus because obviously it's just read code files there it's obviously probably read a lot of patches but I would wager it's probably read more code files and it has patches so it's probably leaning on a different part of its brain is my speculation I have no proof for this so I think the discipline of writing code is slightly different but certainly is its most comfortable State when it's writing code once so once you get to that point so long as you're not too deep into the context window another thing that I'll bring up in that in that blog post is um performance of Genie over the length of the context window degrades fairly linearly so actually I actually broke it down by probability of solving a sbench issue given the number of tokens of the context window it's 60k it's basically 0.5 so if you go over 60k in context length you are more likely to fail than you are to succeed just based on the amount of tokens you have on the context window and when I presented that to the fine tuning team and open AI that that was super interesting to them as well and that is more of a foundational model attribute than it is an us attribute however the attention mechanism Works in in gp4 or however you know they deal with the context window at that point is you know influencing how Genie is able to form even though obviously all our all our training data is perfect right so even if like stuff is being solved in the 110,000 tokens sort of that area the training data still shows it being solved there but it's just in practice the model is funny much harder to solve stuff down that end of the context window that's the scale with the context so for a 200k context size is 100k tokens like the5 I I don't know yeah but I I I um hope not I hope you don't just take the context length and have it and then see this is the usable context length but what's been interesting is knowing that actually really digging into the data looking at the log probs looking at how it performs over over the entire window it's influenced the shortterm improve ments we've made to Genie since we did the that got that score so we actually made some small optimizations to try to make sure as best we can without like overdoing it trying to make sure that we can artificially make sort stuff sits within that sort of range because we know that's our sort of B Zone and if we go outside of that we're starting to push the limits and we're more likely to fail so just doing that sort of analysis has been super useful without actually messing with anything um like more structural and and getting more performance out of it what about um different languages so in your technical report the data mix is 21% JavaScript 21% python 14% typescript 14% TSX um which is Javascript JavaScript yeah yes like 49% JavaScript that's true typescript is so much superior but anyway do you see how good is it at just like generalizing you know if you're ring rust or C++ or whatever else it's quite different it's pretty good at generalizing um obviously I think there's 15 languages in that technical report I think that we've that we've covered the ones that we picked in the highest mix were the ones that selfishly we internally use the most and also that are I'd argue some of the most popular ones when we have more Resource as a company and more time and you know once all the craziness that has just happened sort of dies down a bit we are going to you know work on that mix I'd love to see everything ideally be represented in a similar level as it is if you if you took GitHub as a data set if you took like how are the languages broken down in terms of popularity that would be my ideal data mix to start it's just that it's it's not cheap doing all this so yeah trying to have an equal amount of of Ruby and and rust and and all these different things is just at the at our current state is is not really what we're looking for there's a lot of good Ruby my G out profile you can have it all well trying for running test it sounds easy but it isn't especially when you're working in Enterprise code bases that are kind of like very hard to spin up yes how do you set that up is like how do you make a model actually understand how to run a code base which is different than writing code for the code base the model itself is not in charge of like setting up the code base and running it so Genie sits on top of GitHub and if you have CI running GitHub you have GitHub actions and stuff like that then Genie essentially makes a call out to that runs your CI sees the outputs and then like moves on making a model itself set up a repo wasn't scoped in what we wanted Genie to be able to do because for the most part like like at least most Enterprises have some sort of CI pipeline running and like a lot of if you're doing some even like a lot of hobbyist software development has some sort of like basic CI running as well and that was like the lowest hanging through approach that we took so when when Genie ships like the way it will run its own code is it will basically run your CI and it will like take the um I'm not in charge of writing this the rest of the team is but I think it's the checks API on GitHub allows you to like grab that information and throw it in the context window what's the handoff like with the person so juny you give it a task mhm and then how long are you supposed to supervise it for or are you just waiting for like the checks to eventually run and then you see how it goes like uh what does it feel like there are a couple of modes that it can run in essentially it can run in like fully headless autonomous modes to say you assign it a ticket in linear or something then it won't ask you for anything it will just go ahead and try or if you're in like the gooey on the website and you're using it then you can give it a task and it it might choose to ask you a clarifying question so like if you ask it something super broad it might just come back to you and say what does that actually mean or can you point me in the right direction for this because like our decision internally was it's going to piss people off way more if it just goes off and has and makes a completely like ruined attempt at it because it just like from day one got the wrong idea so it it can ask you a lot of questions and once it's going much like a regular PR you can leave review comments issue comments all these different things and it because you know it's been trained to be a software engineering colleague responds in actually a better way than a real colleague because it's less snarky and less high and mighty and also the amount of filtering has to do for lgtm when you train a model to like be a software engineer essentially it's like you can just do anything it's like yeah looks good to me bro ship it I just wanted to dive in a little bit more on your experience with the fine tuning team John Allard was publicly sort of fre commentary supportive and you know was was part of it like what is it like working with them I also picked up that you initially started to fine tune what was publicly available the 16 to 32k range you got access to do more than that you've also trained on billions of tokens instead of the usual Millions range just like take us through that fine-tuning journey and any advice that you may have it's been so cool and this will be public by the time this goes out like open ey themselves have said we are pushing the boundaries of what is possible with fine tuning like we are right on the edge and like we are working genuinely working with them figuring out how stuff works what works what doesn't work because no one's doing no one else is doing what what we're doing they have found what we've been working on Super interesting which is why they they've allowed us to do so much like interesting stuff working with John I mean I had a really good conversation with John yesterday we we had a little brainstorm after the video we shot and one of the thing you mentioned the billions of tokens one of the things we've noticed and it's actually a very interesting problem for them as well when you're building like a self serve fine tuning API they have to decide how big your P adapter your adapter is going to be in some way and like figuring that out is actually a really interesting problem because if you make it too big and because they support data sets that are so small you can put like 20 examples through or something like that like if you had a really sparse large adapter you're not going to get any signal in that at all so they have to dynamically size these things and there is an upper bound and actually we use models that are larger than what's publicly available it's not publicly available yet but when this goes out it will be but we have larger lore adapters available to us just cuz the amount of data that we're pumping through it and at that point you start seeing really interesting other things like you have to change your learning rate schedule and do all these different things that you don't have to do when you're on the smaller end of things so working with that team is such a privilege because obviously they're like at the top of their field in you know in the fine tuning space so we as we learn stuff they're learning stuff and one of the things that I think really catalyze this relationship is when we first started working on Genie like I delivered them a presentation which will eventually become the blog post that you'll love to read soon the information I gave them there I think is what showed them like oh wow okay these guys are really like pushing the boundaries of what we can do here and truthfully our data set we view our data set right now is very small it's like the minimum that we're able to afford literally afford right now to be able to produce a product like this and it's only going to get bigger so yesterday while I was in their offices I was basically so we were planning we were like okay how this is where we're going in the next 6 to 12 months like we're putting our foot on the gas here because this clearly works like I've demonstrated this is a good you know the best approach so far and I want to see where it can go I want to see what the scaling law is like for the data and at the moment like it's hard to figure that out because you don't know when you're running into like saturating a PFT adapter as opposed to actually like is this the model's limit like where is that so finding all that stuff out is the work we're actively doing with them and yeah it's it's going to get more and more collaborative over the next few weeks as we as we explore like larger adapters pre-training extension different things like that awesome I also wanted to talk briefly about these synthetic data process um one of your core insights was that the vast majority of the time the code that is published by human is is in a working State and actually you need to F tune on nonworking code yes so just yeah take us through that inspiration how many rounds uh did you did you do it might it might be generous to say that the vast majority of code is in a working State I don't know if I like that's very nice if you say that my code Works certainly it's not true for me um no I think that so so yeah no but it it was you're right it's an interesting problem and and what we saw was when we didn't do that obviously we were just you have to basically like oneshot the answer cuz after that it's like well I've never seen iteration before how am I supposed to figure out how this works so so what the um what you're alluding to there is like the self-improvement loop that we started working on and that was in sort of two parts we we synthetically generated runtime errors where we would intentionally mess with the as to make stuff not work or index out of bounds or refer to a variable that doesn't exist or errors that the foundational models just make sometimes that you can't really avoid you can't expect it to be perfect so we threw some of those in with a with a with a probability of happening and on the self-improvement side I spoke about this in the in the blog post essentially the idea is that you generate your data in sort of batches first batch is like perfect like one examp like here's the problem here's the answer GO train the model on it and then for the second batch you then take the model you trained before that can look like one commit into the future and then you let it have the first attempt at solving the problem and hopefully it gets it wrong and if it gets it wrong then you have like okay now the code base is in this incorrect state but I know what the correct state is so I can do some diffing essentially to figure out how do I get the state that it's in now to the state that I want it in and then you can train the model to then produce that diff next and so on and so on and so on so the model can then learn and also reason as to why it needs to make these changes be able to learn how to like learn like solve problems iteratively and learn from its mistakes and stuff like that and you pick the size of the data set just based on how much money you could spend generating it maybe you think you could just make more and get better result what multiple of my monthly burn do I want spend doing this yeah basically it was it was very much related to yeah just like capital and um yes with any luck that that will be alleviated soon very soon yeah yeah I like drawing references to other things that are happening in in the in the wild so cuz we only get to release this podcast once a week the Llama 3 paper also had some really interesting thoughts on synthetic data for code I don't know if you haveed that U I'll highlight the the back translation section because one of your data set focuses is updating documentation I think that translation between natural language English versus code and back and forth I think it's actually a really ripe source of synthetic data and Lama 3 specifically called out that that they trained on that yeah U we should have gone more into that in our podcast with them but we uh we didn't we didn't know but uh there's a lot of interesting work on synthetic data stuff we do have to wrap up soon but I'm going to briefly touch on the submission process for sbench so you have a 30% state-ofthe-art sweet bench result but it's not on the leaderboard because the submission issues I don't know if you want to comment on on like that stuff versus uh you know we also have like a we also want to talk about sweet bench verified um yeah just anything on the benchmarking side the Potted history of this is is is quite simple actually s bench up until I want to say two weeks ago but it might be less than that or more than that but I think two weeks ago suddenly started mandating what they call trajectories when you submit so prior to this essentially when you run sbench you run it through their harness and out the other end you get a report. Json which is like here's how many I resolved here's how many I didn't resolve these are the IDS the ones I did these ones the IDS I didn't and it gives you any ones that like might have errored or something like that and what you would submit would be all of your model patches that you outputed and that report and then you would like PR that into the sbench repo and that would be it that was the still the case when we made our submission on whatever day it was they look at them every Monday we submitted it at some point during the week I want to say it was four four days before that and um I sort of like sat back and waited I assumed it would be fine when it came to Monday um they then said actually no we want model trajectories and I was like okay let me see what this is and and so on I sort of dug into it and like model trajectories are essent the context window or like the reasoning process of like show your working how did you get here if you do a math exam show me your working whereas before they were like just give me the final answer now they want to see the working which I I completely understand why they want to see that like sbench fundamentally is an academic research project and it they want all the stuff to be open source and public so people can learn from each other and improve and so on and that that's very good I completely agree however at least for us and the reason that we're not on the leaderboard is that obviously the model outputs that we generate are sort of a mirror of our training data set right like you train the model to do a certain thing and output a certain way whatever your output looks like your training data for the moment as a close Source company like fighting for an edge we've decided not to publish that information for that exact reason I don't want someone basically taking my trages and then taking a mod suing them a ga and just distilling it immediately and then having Genie for themselves and you know as a business owner that's the decision I've had to make the patches are still public so like the dare I say Trad traditional swe bench submission you can go to or GitHub repo and see it and run them for yourself and verify that the numbers come out correctly like that is all that is the Potted reason why that's the story uh verified you have a score I do have a score I do have a score 43.8% it's one of those things where like there aren't that many people on the leaderboard yet so you don't know how good or bad that is it's a smaller data set right oh it's it's great so on a tangent sbench original sbench was 2,294 which is expensive it's like $88,000 to run oh that's that's cheap cheap I I know at least for us I I don't even want to say publicly how much it cost how much it cost us to run that thing expensive slow really like crap for iteration because like you know you make a change to your model how does it do on sweet bench I guess that's why sweet bench light existed but sweet bench light was not a it was there was easy stuff right it wasn't a comprehensive measure of the overall thing so we actually had the idea a month ago to what we were going to call S bench small where we were going to try to map out across sbench like what is the distribution of like problem difficulty and all these different things and tried to come up with like 300 examples that sort of map that where given a score on S bench more you could then predict your s bench large score and sort of go from there fortunately open AI did that for us and probably much better than we would have done they use some human labelers and as obviously we're working with with open AI quite closely they talked to us about it and they um you know were able to let us know what the instance ID were IDs were that were in the the new sbench version and then as soon as I had that I could just take the report from the one that I run and just diff them and I was like oh we got 219 out of 500 which is 43.8% which is to my knowledge at least right now state-ofthe-art also which makes sense but also gbt 40 gets I believe 33% which is like I double check that but I the August one the the new one yeah it's in their blog post I I can't remember which one it was I don't know what the model version was but gbt 4 I believe gets 33% which is obviously like significantly better than what it got on the um original like swe swe 2% yeah exactly exactly it's something ridiculously though but no sweet bench verified like it's so good it's like it's smaller we know that the problems are solvable it's not going to cost me a lot of money to run it it keeps my iteration time you know lower and there are also some things that we're going to start to do internally when we run sbench to have more of an idea of how right our model is so one of the things I was talking to John about yesterday was sweet benches a pass or fail right like you you either have solved the problem or you haven't that is quite sparse like it doesn't give you a huge amount of information because your model could have got a lot of it right like looking through when you do a math paper you could have got the re you know you're working right until like the penultimate step and then you get it wrong so we're going to look into ways of measuring okay well your model got it right up to this line and then it diverged um and that's super easy to do because obviously you know the correct state of all those questions so I think one of the ways we're going to keep improving Genie is by going more in depth and saying okay for the ones that failed was it right at any point where did it go wrong how did it go wrong and then sort of trying to triage those sorts of issues so future plans you have mentioned context extending an open source model but basically I think you know what the genie is is basically this like proprietary fine tune data set and process and software that uh you can add on to any model is that the plan that's that that's the the next year it's going to just be doing that we're going to we're going to get really we're going to be the best in the world at doing that um and continue being the best in the world at doing that and throwing it as many models as we can um seeing what the performance is like and seeing what things improve performance in what places um and also making the data set larger is like one of the biggest things that we're going to be working on I think one of the decisions before you as as a CEO is how much you have like the house model be like the one true thing and then how much spend time working on customer models that's the thing that really that gets me so excited genuinely like we have a version of Genie that we named after one of our employees it's called the John uh we have a version of Genie that is fine-tuned on our codebase so we basically it's the basic base Genie and then we run the same data pipeline that we run on like all the stuff that we do to generate the main data set on our repo and then all of a sudden you have like something that is both very good at software engineering but is also extremely good at your repo and that is phenomenal to use like it's really cool more brole outside of cosign what are you seeing what what trends are you uh seeing that you're really excited by who's doing great work that you want to call out the one of the one of the ones that I mean it's it's not an original choice but curser are absolutely killing it all the employees at coine love using it and it's a really really good example of like just getting like ux right basically like the it putting the llm in the right place and letting it allow you and getting out of the way when you don't want it there and making it familiar cuz it's still vs code and all these things they've yeah they've done an amazing job and I think they just raised around so congrats on that to them so like they're doing amazing work the decision to Fork vs code I think was controversial you guys started as a vs code extension many many many people did that and they did the one thing that no one wanted I commend The Bravery honestly like I commend The Bravery cuz like in hindsight obviously it's paid off but at least for me in the moment I was one of those people being like is that going to are people going to do that are people going to download that and yes obviously they are like sure doing the hard thing which is having worked on Genie recent you know for the past eight months or whatever as taxing as it's been on us like one of the main things I have learned from this is like no matter how small you are how much resource you are just like try to do the hard thing because it I think it has the biggest payoff more broadly just like uh lessons that you've learned running your company oh it's been a two it's been a twoyear journey two-year Journey um I mean it's better than any real job you could ever get like I feel so lucky to be working in this area like especially you know it was so validating to hear it from the guys at open a as well telling us like we're on The Cutting Edge on the B we're pushing the boundaries of what's possible with what we're doing because like I get to do I get to be paid to do this you know I I have briefly as you heard at the beginning done real jobs and and normal stuff and like just being able to do this on The Daily is so interesting and so cool it's like I pinch myself a lot genuinely about the fact that I can do this and also that not only I can do this but fortunately being a co-founder of the company I have a huge amount of say as to where we go next and that is a big responsibility but is also so exciting to me because I'm like you know steering this ship is has been really interesting so far and I like to think that we've got it right you know in the last in the last sort of eight months or so uh and that this is like really the starting point of something massive to come awesome call to action uh I assume you're hiring I assume you're also looking for customers what's the ideal customer ideal employee on the customer side honestly people who are just willing to try something new like the genux is is different to a conventional IDE give it a chance like that we we really do believe in this whole idea of like developers work is going to be abstracted you know levels higher than just the code we still let you touch the code we still want you to dive into the code if you need to but fundamentally we think that if you're trying to offload the coding to a model the model should do the coding and you should be in charge of guiding the model so people who are willing to give something you a chance size of company and honestly well preferably the languages that are the most represented in our in our training days so like any if you're like doing typescript JavaScript python Java that sort of thing and in terms of size of company like so long as you're willing to try it um and there aren't any massive like infc things that get in the way like it doesn't really matter like code base size can be arbitrary for us we can deal with any codebase size and essentially any language but your mileage may vary but for the most part like anyone who's willing to give it a try is the ideal customer and on the employee honestly we just want people who um we're going to be hiring both on like what we call like tra like the traditional Tech side so like building the product essentially and also hiring really heavily on the AI machine learning um data set side as well and in both cases essentially what we just wanted like really passionate people who are obsessed with something and are really passionate about something and are willing to it sounds so corny but like join us in what we're trying to do like we have a very big ambition and we're biting off a very large problem here and people who can look at what we've done so far I'm been like wow that's really impressive I want to do that kind of work I want to be pushing the boundaries I want to be dealing with experimental stuff all the time and but at the same time be putting it in people's hands and shipping it to people and so on so if that sounds you know amable to anyone that's the kind of person we're looking to apply excellent any last words any Trump Impressions that you did you like the Trump impression yeah everyone loved the Trump impression yeah I mean it's funny cuz like I I I I have some bloopers I'll show you the bloopers after we finish recording I'll probably tweet them at some point the initial cut of that video had me doing a trump impression I sort of sat down into the chair and been like Coan is the most tremendous air lab in the world unbeliev I walked in here and I said wow this is an amazing lab and like we sent it to some of our friends and they were like nah you can't cold open with Trump man you just can't like no one knows who you are end but you can end with it now that that has gone out we can now um we can now post the rest of the bloopers which are essentially me just like fluffing my lines the entire time and screaming at my co-founder out of frustration so well it was very well executed uh actually very few people do the C video that you did I'm as a sort of developer relations person I'm actually excited by that stuff but um well thank you for coming on very very short notice I hope you have a safe flight back and excited to see the the full launch um I think this is a super fruitful area and uh congrats on your launch thank you so much for having me cheers [Music]

Original Description

Meet Cosine’s Genie: https://www.latent.space/p/cosine SWE-Bench has been the most successful agent benchmark of the year, receiving honors at ICLR (our interview here) and recently being verified by OpenAI. Cognition (Devin) was valued at $2b after reaching 14% on it. So it is very, very big news when a new agent appears to beat all other solutions, by a lot: While this number is self reported, it seems to be corroborated by OpenAI, who also award it clear highest marks on SWE-Bench verified: The secret is GPT-4o finetuning on billions of tokens of synthetic data. Finetuning: As OpenAI says: Genie is powered by a fine-tuned GPT-4o model trained on examples of real software engineers at work, enabling the model to learn to respond in a specific way. The model was also trained to be able to output in specific formats, such as patches that could be committed easily to codebases. Due to the scale of Cosine’s finetuning, OpenAI worked closely with them to figure out the size of the LoRA: “They have to decide how big your LoRA adapter is going to be… because if you had a really sparse, large adapter, you’re not going to get any signal in that at all. So they have to dynamically size these things.” Synthetic data: we need to finetune on the process of making code work instead of only training on working code. “…we synthetically generated runtime errors. Where we would intentionally mess with the AST to make stuff not work, or index out of bounds, or refer to a variable that doesn't exist, or errors that the foundational models just make sometimes that you can't really avoid, you can't expect it to be perfect.” Genie also has a 4 stage workflow with the standard LLM OS tooling stack that lets it solve problems iteratively. Timestamps [00:00:00] Alistair and Cosine intro [00:11:34] GPT4o finetuning [00:15:18] Genie Data Mix [00:18:09] Customizing for Customers [00:20:37] Genie Workflow [00:22:41] Code Retrieval [00:30:20] Planning [00:37:29] Language Mix [00:3

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Latent Space · Latent Space · 42 of 60

← Previous Next →

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Ep 18: Petaflops to the People — with George Hotz of tinycorp

FlashAttention-2: Making Transformers 800% faster AND exact

FlashAttention-2: Making Transformers 800% faster AND exact

RWKV: Reinventing RNNs for the Transformer Era

RWKV: Reinventing RNNs for the Transformer Era

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

RAG is a hack - with Jerry Liu of LlamaIndex

RAG is a hack - with Jerry Liu of LlamaIndex

The End of Finetuning — with Jeremy Howard of Fast.ai

The End of Finetuning — with Jeremy Howard of Fast.ai

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Four Wars of the AI Stack - Dec 2023 Recap

The Four Wars of the AI Stack - Dec 2023 Recap

The State of AI in production — with David Hsu of Retool

The State of AI in production — with David Hsu of Retool

Building an open AI company - with Ce and Vipul of Together AI

Building an open AI company - with Ce and Vipul of Together AI

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Making Transformers Sing - with Mikey Shulman of Suno

Making Transformers Sing - with Mikey Shulman of Suno

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Why Google failed to make GPT-3 -- with David Luan of Adept

Why Google failed to make GPT-3 -- with David Luan of Adept

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Breaking down the OG GPT Paper by Alec Radford

Breaking down the OG GPT Paper by Alec Radford

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

LLM Asia Paper Club Survey Round

LLM Asia Paper Club Survey Round

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How AI is Eating Finance - with Mike Conover of Brightwave

How AI is Eating Finance - with Mike Conover of Brightwave

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

State of the Art: Training 70B LLMs on 10,000 H100 clusters

State of the Art: Training 70B LLMs on 10,000 H100 clusters

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Synthetic data + tool use for LLM improvements 🦙

Synthetic data + tool use for LLM improvements 🦙

RLHF vs SFT to break out of local maxima 📈

RLHF vs SFT to break out of local maxima 📈

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Answer.ai & AI Magic with Jeremy Howard

Answer.ai & AI Magic with Jeremy Howard

Is finetuning GPT4o worth it?

Is finetuning GPT4o worth it?

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Building AGI with OpenAI's Structured Outputs API

Building AGI with OpenAI's Structured Outputs API

Q* for model distillation 🍓

Q* for model distillation 🍓

Finetuning LoRAs on BILLIONS of tokens 🤖

Finetuning LoRAs on BILLIONS of tokens 🤖

Cursor UX team is CRACKED 💻

Cursor UX team is CRACKED 💻

Choosing the BEST OpenAI model 🏆

Choosing the BEST OpenAI model 🏆

How will OpenAI voice mode change API design?

How will OpenAI voice mode change API design?

STEALING OpenAI models data 🥷

STEALING OpenAI models data 🥷

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

Prompt Engineer is NOT a job 📝

Prompt Engineer is NOT a job 📝

Prompt Mining LLMs for better prompts ⛏️

Prompt Mining LLMs for better prompts ⛏️

The six pillars of few-shot prompting 🔧

The six pillars of few-shot prompting 🔧

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Can you separate intelligence and knowledge?

Can you separate intelligence and knowledge?

The video teaches the importance of fine-tuning GPT-4 for software engineering tasks and demonstrates the applications of retrieval-augmented generation and fine-tuning in improving code writing and retrieval accuracy. It also highlights the need for data cleaning, alignment, and sharing for model development. By following the steps outlined in the video, viewers can learn how to fine-tune GPT-4 and develop their own LLM-based systems.

Key Takeaways

Fine-tune GPT-4 for specific tasks
Develop a proprietary fine-tune dataset and process
Utilize retrieval-augmented generation for code writing and retrieval
Clean and align data for model development
Share data with companies for model development
Use GitHub's CI pipeline to run code and grab information

💡 Fine-tuning GPT-4 can significantly improve its performance on software engineering tasks, and retrieval-augmented generation can be used to improve code writing and retrieval accuracy.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Fine-tuning LLMs

View skill →

Fine-tuning T5 LLM for Text Generation: Complete Tutorial w/ free COLAB #coding

Fine-tuning T5 LLM for Text Generation: Complete Tutorial w/ free COLAB #coding

Train image classifier using transfer learning - Fine-tuning MobileNet with Keras

Train image classifier using transfer learning - Fine-tuning MobileNet with Keras

Advanced Fine-Tuning in Rust

Advanced Fine-Tuning in Rust

GPT-4o: Fine-tune OpenAI's Multimodal Model | Live Coding & Q&A (Oct 3rd)

GPT-4o: Fine-tune OpenAI's Multimodal Model | Live Coding & Q&A (Oct 3rd)

LLM Fine-tuning: Two Crucial Tips for New Models - LLama 2

LLM Fine-tuning: Two Crucial Tips for New Models - LLama 2

SDXL LORA STYLE Training! Get THE PERFECT RESULTS!

SDXL LORA STYLE Training! Get THE PERFECT RESULTS!

Related AI Lessons

10 ChatGPT Prompts for Job Seekers: Resumes, Interviews & Career Growth

Learn how to leverage ChatGPT for job searching, resume building, and career growth with 10 actionable prompts

Medium · ChatGPT

Lost in Transcription: The Week the Machine Started Lying

Learn how Whisper AI transcription can be flawed and understand the importance of validation in AI-generated text

How We Translate 300-Page Books Using Claude Without Hitting Token Limits

Learn how to translate long documents using Claude without hitting token limits by breaking them into overlapping chunks

Dev.to · 龚旭东

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)