Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI
Key Takeaways
The video discusses the development of Llama 2, 3, and 4, and their path to open source AGI, with a focus on scaling laws, fine-tuning, and reinforcement learning with human feedback, using tools such as Bloom, Galactica, and GPT-4.
Full Transcript
[Music] hey everyone welcome to the laden space podcast this is alesio partner and CTO and residents at deso partners and I'm joined by my co-host swix founder of small AI hey and today we have a very special episode with Thomas yum I don't know how to describe you've you've done so much work in a very short amount of time at maida but you were most notably leading Lama 2 and now today we with we're also coordinating on the release of Lama 3 so welcome thanks for having me to be clear obviously the Lama 3 45b is that the official size number that we're going with or is it do we just say 400b t text model only yes a bit of additional MERS for the multimodel version that we come later awesome awesome just to quickly go over your background you actually we had a slightly similar pass I was also a quantitative Trader uh and it looks like you did five years in Quant Finance working a trading timer in sock gen and then you transitioned into natural language getting a PhD at sbon where on recile as well and then right after your PhD joining meta no it's exactly that but basically I think it's at the alphao moment where I was doing some trading I say like what I need to understand what's the technology behind that and I wanted to study machine learning I did first some training like six month degree executive degree at the end of which I knew like what EG boost at the time and nothing about deplaning at all so and you know most of the people around were like phg people and well okay PhD seems rical deing Seems rical so I want to do a PhD in deing that's where I joined um we have this PhD program in France within a company and Academia and so I did my PhD with the recal in San University on natural language generation reinforcement learning I guess it was a good topic I was not like a Visionary it was really random that's the company that offered me like this topic and it was something like I started two weeks before ber excellent timing Yeah we actually also just released our episode with Clementine Fier who also did her PhD with a company in kind of like a very similar format I think yeah very underrated very underrated this sort of PhD with industry uh expertise because you're also like publishing papers the whole time I I looked at your publishing history you you were doing like summarization work you're doing factual consistency work you release some benchmarks and then you worked on language Gans uh before the transform took over we we can come back to that later but I should have I mean papers have like 10 50 citations if I'm pretty sure that if I call them likef without human in the loop but like a discriminator which synthetic human in the loop I will have get much more citations today because like all and all the inspiration from this paper we from actually the original open a paper of lhf but at Academia we don't have the way to pay annotation online like that so how to simulate it yeah a lot of these ideas are repeated like discriminator generator we just call them different names now like verifier whatever well I think your your your progress into NLP was like really strong because like the first thing you worked on at meta was Bloom yeah yeah actually I started to work on that before jining meta I was not like one of the main contributors but it was at intersection of multilinguality which was very important to me large language modeling and uh that's why actually my first big project at meta and the team I was working on was Galactica and actually interesting step back from Bloom was like we did a lot of mistakes but it was expression that's expected again we learned a lot but like trying to scale towards like multilinguality in fact we learn later that multilinguality almost emerged naturally with very very few data which was really surprising and not not expected at all for us at the time I mean my learning from that is just there's a natural Harmony of language that is abstract from English when you learn English you learn language and then language just translates to other forms of languages especially if they're in the same family right like uh yeah so maybe we should get right into llama 2 spend a little bit of time there and then and then we'll go into llama 3 so like uh what is the story of llama 2 from your point of view yeah so as I was saying I started to meta on Galactica that was one of the first large language model at meta the language model for science we really it in I think I think December or end of November I don't remember one year and a half ago I don't know if people remember but it was huge on Twitter both with people like thinking it's the end of Science and like that with a lot of hallucination papers were like it's super awesome I still think it was super awesome but you know we didn't do like instruction tuning G Chef techniques at the time it was a weird moment because two weeks later chat GPT came out and that's the moment where like I think all the fan companies went upside down and where we had a a huge traction from leads to Now work on that and make a chgb as soon as possible so we had this one two months of like what to do I actually was working on Galactica instruct which basically you could connect it we had a partners with overa the Google Doc of like scientist where you can write papers and it's in you right there in lat you have to do a lot of citations so the idea was like you can just like C GPT or G instruct ask or swap two colums in a lat taable that's something very very time consuming I can promise you could like say oh find me a citation about llms and bias we'll find you some papers insert automatically the bib in like lat so that was pretty coal but because of the backlash we never like opened it in the end oh because the the galactical backlash oh yeah yes like I was just saying like today it's not solved cuz Lucas Bayer is still asking for this citation generator I saw this tweet I was dude we had that two years ago and I promised I tested it it works so well I had it on over Leaf integrated I tested it wow yeah yeah no it went quite far in fact and actually about citations like it's anecdotical but because the way Galactica was trained to site papers with all the references in paper that's what made it emerge so easily at instruction timing actually Galactica instruct was the first annotation project for F meta that was a followup of Galactica that we were preparing and at the same time my friends from Paris office created Lama one it's like to connect the dots with what we said before the last author was g l who founded mistal the first auor is you go who worked with me on L still at MAA and both did a PhD program within meta as a company and an Academia so that's a pretty good program indeed and so we worked on L 2 from that that point we had the support from the company leadership that was one of the main priority we had Lama one and Galactica like backbone of good language model we started from Lama one and we worked mainly with go on how to make instruction following and chat models that will follow instructions so all the supervis funing stage then thef there are some paper so you had some inition from there we could use but in fact at large scale and that was probably the most challenge for us there's no research anymore we don't know how much to scale can you describe what scale you're talking yeah yeah yeah to what level of annotation to scale The annotation like do you need 100,000 1 million 10 million annotation of super funing of a preference we had no idea what is the actual algorith to do how often to retrain some models you have just the basic but then when it comes to like chat GPT or GPT instruct or clo no one published the details there and so we had to reinvent the wheel there in a very short amount of time and what about parameter size this is one question that a lot of folks had about llama 3 so llama 1 you had 7B 13B 33b 65b model sizes and then llama 2 7 13 70 how do you kind of evaluate what's worth training especially when you think about data like you know maybe 100,000 is enough for like a 7B model but it's not enough for a 70b model how do you decide model size especially when you're Maybe annotation constraint on some of these things that's a very good question and there's no good answer there's so many parameters to take into account from the scaling lows at training time to get the the best performance the GPU constraint and on what different adwares and we think about meta but also of the community and like people are not just using h00 but there's a00 there's a there's different size of gpus memory so which size will fit in what and what is the most useful also at inference time not just at fine tuning time then you can maybe do some tricks at inference time to conze it a bit or FP 16 or fp8 now all those constraints makes it very very challenging at infont time you have a lot of costs so how to trade of between inference cost and training cost it's a very challenging problem in general we tend to think in particular for Lama 3 Lama 2 maybe I would say it's like Lama one we had a flagship model which was 70b it's also because the project was taking some routes to reproducing chinchila which was a 70b for l we also like move to one size Mar the flagship model for z5b I think there was also the question of we want a model at this time we have this amount of compute given the scaling lows and number of tokens we have to train it what will be the right balance to Ste like fits in a f time so we try to to have some trade of like that yeah and you mentioned chinella is the best way to go but then you tweeted recently don't fall into the chinchilla trap if you want your model to be used by billions of people so what's the updated state of of scaling laws I I think there was obviously the Kepler and then there was chinchilla and then people kind of called The Llama scaling law like the 100 to 200x kind of parameter to to again ratio what's your updated thinking on how to think about skaing loss when you pick model size and and training data right so you know there was as you said this cap paper with SC lows but they figured out basically they tried two Dimensions the model weights and the number of training time like number of steps training tokens AO and for that they figure out that model size is what matters so gpt3 was way too big compared to the actual number of train tokens because they did a mistake not adapting the schul that's what chinchi fiz and discovered to be fair I think open I knew that at the time of Tina paper but yeah basically chinchila said we have to revisit the scaling originally published by capan and emphasize much more the importance of training tokens and they did like some really good scanning load showing that there's an optimal basically you need to double the number of training tokens every time you double the training weights to get an optimal ratio so that for a finite number of compute you will end with the best results in your paper and what I call the chincha Trap is that that's good if you want the best Flagship model that obtains the highest performance on your paper but if you want to use your model at in front time in ference the two Dimensions one Remains the model weights but one drops the number of tokens you train it number of steps and so to be compute efficient at inference time it's much better to train it much longer at training time even if it's an effort additional effort than to have a bigger model That's What I Call like I refer to the chinchilla trap not that chinchilla was wrong but if you consider inference time you need to go beyond chinchila and in fact that's what lamaan FS did by over training in the sense they could have get a better performance in Vaper but they prefer to create the best artifact that will be used by the community so that's the scaling thinking what other went into llama 3 kind of planning you know so llama 2 you have a pretty good model people really liked it in Lama Tre you drop like the Intermediate weight so it's 870 and now 400 5B what was the thinking about how going so large I mean you talked about the hardware capabilities at inference like I cannot run a 400 5B model at home for sure and it might be hard to even get the the cloud resources to do it what was the decision there the decision is super simple we want the best model we want to be number one and number two we started one year and a half ago and we did quite some Jour we fill the Gap with gp4 so that will be the first open source model that actually compares to gp4 there's now GPT 4 Z of course and we're close but we're not there yet uh not in all capabilities but the Gap is uh getting smaller and smaller there also like what compute we had at the time when we started the run in January we put a lot of effort there but as like Mark announ we have more and more gpus so the negation will be bigger so that's what drives the decision now maybe let me reflect two things he said you cannot use it at home that's probably true but quantizing it to fp8 can run on nod even with a long context of 128k uh tokens second thing is I'm hopeful that the community will lead to a lot of findings by open sourcing it and very is smart way to actually make you use it on your computer if you remember L one L 2 like when we published models people were saying it's too big and after two weeks it was running on a Raspberry I don't know if it would be the same but I hope the same kind of trend and by releasing those models we are enabling that now the last thing I want to add is having bigger models enables to collect better data for instance at a shf stage because that's the model we use for The annotation and so we distillate straightforward like those annotation from this beta model to the other models so I can guarantee you that the quality of the smaller models we are releasing with L3 are also thanks to having these artifacts where we can collect and training yeah there's a lot of really good there one thing I I'll just briefly touch on for quantization there was a recent Nome shazir blog post Nome is writing again for for some reason and you know he was talking about sort of native fp8 training it seems like that is most useful for inference that is what you expect the open source Community to do with your weights once you release them anyway is there any movement or thinking about just moving to fp8 or you know whatever other new format is invoked these days so these papers like to train like some I forget the name but like there's two followup papers on like just 01 or minus one weights and like there's a lot of work there I think it's promising directions of all regarding FP in particular there a possibility for the community to try fp8 all the methods that are very easy at F tuning time for the model so I'm looking forward what theity can build there over like scanning I don't know if it's only need but I will not bet against scaling and one of the way to get more scale is by having a better algorith that we can train for this same level for Less compute less compute and less memory yeah like like inference time memory is is becoming a real constraint yeah yeah but also training with fp8 if you unlock training with fp8 or I mean fp0 is probably nonsense but to what extent how far we can go you know and every time like you unlock compared to uh what we had two three years ago on 32 or 64 it's like huge progress in term of scaling for me it's interesting to say to see you mention the Turner quantization like the 1.58 uh bit thing because I didn't know that I don't know how much to believe you know like there there's a lot of these kinds of papers where it makes a lot of noise but it doesn't actually pan out doesn't scale I totally agree with you it's so hard for researchers at least for me to see all those papers published all those cool ideas all those results that are preliminary and in all those massive amount of research what will scale or not what will resist the test of time or not and are we like losing maybe some gems that are not just people are not working on them but because there's too much research around I don't know maybe and that's like some problems to her that's cool to have these problems now day compared to probably what Yan and the others had 30 years ago but still it's a problem you know for what it's where like I do think that fair is putting out like incredible research you know probably it doesn't seem like it's your group but you know you also recently published mobile llm which on the small model side is uh is a really good research on just small model architecture that it looks like hugging face is also replicating it and like it's doing quite well like you know there's a lot of ideas on shared weights and shared matrices and you know model architecture stuff that we can talk about for smaller scale models like llama is not at that scale but it seems like one of the big themes of this year is like on device in browser small models that are like good enough for you know daily use I do want to talk about architecture right like uh I'm not sure when you're releasing the Llama 3 research paper but in llama 2 you talked a little bit about uh the architecture choices like any it will be released the day I think of the release okay what should people know or what are the the major choices of llama 3 versus llama 2 there's not like a lot of changes in term of architectures I think we can do a lot lot better in the future and not just like with Transformers but for instance to me like it doesn't make sense to use the same amount of compute per token for every token like those Aral lack of flexibilities there's a lot of research to go there but still that's the best thing we have for now and so it's the same recipe than in term of architectures and training than Lama 2 but we put so much effort on scaling the data and the quality of data is now 15 trillion tokens compared to 2 trillion so it's another vud there as well including for the smaller models one of the things I noticed on the paper is that you use llama to to do the data cleaning for what went in Teter I think there's a lot of chatter obviously about synthetic data and like there was the refr the web paper that came out maybe a few months ago about using you know mral to make training data better any learnings from that it's like is there how much can you rewrite with the models like uh I'm sure people would love to hear more about it right so it's a very interesting research Direction synthetic data in general synthetic data for pre-training my intuitions is that the web is full of [ __ ] in term of text and training on those tokens is a waste of computes just having a good classifier that laeliz that is cool and Lama was the at the time before Lama fre the best model we had access to uh legally to labilize the web and select what are the good token the bad token the addition thing is that it also enabled to have a topic tag like is it about law is it about politics is it about chemistry math reasoning so that you can also adapt a bit the mixture to like balance a bit more the diversity to me you know I'm not exactly sure what you guys did but like I feel like when people say synthetic data there needs to be different categories of synthetic data now because I think there's so many different usage of this thing but specifically synthetic data for pre-training it feels almost like you're running multiple epochs on the raw data when it's while it's rephrased or reformatted by a language model right and in my mind it's very similar to computer vision where you do data augmentation on an item right like we're doing data augmentation that's the less cool name for synthetic data that's very interesting I totally agree with you related to pre trining totally stamp point you said I think it's very different though for post training and the future Direction and synthetic data that I'm personally excited like for instance what I'm excited about is we had this survey on AED LM a year ago and all the idea is like if you a your LM with something else it can be a retriever it can be search it can be a tool it can be a calculator it can be a code execution then you are not just distilla like do doing some data augmentation with your model but you actually adding some expert skills that possibly goes beyond the model weight for instance if your model like can calculate something it was wrong before and now it has access to a calculator and you can retrain your model on that then you're learning something new if your model didn't know something about LMA 2 probably doesn't know a lot about LMA 3 but now if it can search online about it and then you train the model on that then you have a a positive feedback look like what we call Expert interaction targeting the the weakness of the model it's like continual augmentation of the language model much Beyond just that augmentation how related is this to Tool use like are you teaching it to use tools to augment the model or are you saying like uh do Active Learning do like where it's weak go augment the model with with extra data and then memorize that that new data right what I said is more like in term of directions not for fre but like when it knows how to use a tool and correct itself this is like a very promising direction that goes much Beyond augmentation for like in the future to keep collecting new data new token people are saying like we are lacking of tokens but if you think about those kind of tokens where the model always go to correct its own weakness it can say like okay that's 10 + 10 okay that's an easy example probably the model knows but for Imagine for something more complex 10+ 10 I expect is to be 20 let's verify with a calculator which is easy for a basic agent now powered bym and then you verified with respect to what you expected that it's correct if it's not you can backpropagate those example directly to the weights and so they will keep learning your face it makes sense what if you been your insights you know you mentioned about just like using calculators uh what if your insights I think just in general a lot of that is just driven using Code generation apart from just tool use what if your insights on just like the data mix of how much code how much multilinguality which is something that you also passionate about we know that that's changed for between Lama 2 and Lama 3 um is it changing for different stages between the different sizes of llama 3 like you know anything anything like of that sort no it didn't for the different size we use the same mostly what happen is we change the data mix during the training of L3 with some findings that happen that I mean training is long so you have to do something while training and what the team did I was working on my side most most post training but so the pre trining team did quite a lot of work to find some have some new findings improve the data mixture along the way and they intersect it before the end of the training I sense a movement in terms of like the curriculum that people are adopting during pre-training and even post-training uh about you know what the mix should be like snowflake is doing some interesting work with uh Enterprise uh intelligence or whatever they call it what are your goals when post training like just at a high level you know like how do you work with like the pre-trained team I think it's quite easy for now because there's not yet like this kind of continual augmentation where it could feedb like pre-training like that one of the big Continuum between pre trining and post training in particular is continual pre trining where you actually continue the pre-training before lhf in a self supervised way but on expert level domains like for to have an expert in code an expert in like reasoning or an expert in multilinguality that enables to collect even better annotation after so that's one thing and then you start from this model to actually do the a che stage and goal about your question like goal was to get the best model in all dimensions that's actually one thing very different to I can comment compared to Lama 2 Lama 2 you know as I said we were nowhere we build entirely end to end all the stack from data notation contract methodology protocol algorith forf F meta and we had to limit our scope we were like not a lot also to work on that we focus mainly on helpfulness following instructions for L 2 and you can see that as in the following months after L 2 a lot of open-source models came distilla GPT 4 many but obtaining better reasoning math coding chat models and we didn't anate at all for code neither for reasoning or Ming and one one thing I'm quite proud is with the early preview release we did of Lama fre back in uh February M or March remember it LEDs quickly to instantly to like state of the TRS for the model size almost competiting with gp4 on the the arena leaderboard where like human fights compare like two models and select their preference and no one since then have been able to put like a l fre model better than what we did on most of the domains from code reasoning mity helpfulness so that's the same that this time as opposed to we tackle like all those different aspects do you have any other thoughts on the more synthetic data focused models kind of like a neotron I think folks were asking if you see that as an interesting uh Direction too kind of having specific synthetic data generation things I don't know about this model exactly but I think like lar had better performance over all I'm very bullish on synthetic data generation but I think just gets better when you have a better model I'm not really bullish on having like a model only for synthetic data generation I understand the need of having like bigger models that then you can rationalize and yeah maybe people will not use them for inference but to distillate some specific knowledge of synthetic data that narrative is I think I totally agree with that but having a model purely for that and not like good at other things I don't think it's the case makes sense one of the architecture questions that I forgot to mention in there was just the architecture choice of like a very big you know form 400b dense model I actually honestly thought that maybe 175 or like you know was kind of the the the peak you know what whatever can fit on like an h100 so basically I think the common question that people have is like why Noe in a way that MRA and the others have have gone and you know it seems like the trend has been Moes and you guys have bucked the trend there I heard that question a lot different aspects there why not amo in the future the other thing is I think a dense model is just one specific variation of the model for an parameter for an with basically one expert so it just an parameter we haven't optimized a lot yet but we have some stuff ongoing and that's an we will explore in the future let's make sure we run through everything on on post training you also had a recent tweet about rly cha versus imitation learning explained in one tweet so we'll put this in the show notes but it's basically like two charts about a doctor opinions on one side there's like whether or not the suggestion is Good from like a Content perspective and the chatbots rank really highly and the Physicians are kind of like you know a bell curve as you might imagine but then the empathetic voting most Physicians are rated not empathetic or slightly empathetic versus all the model responses are rated very empathetic and empathetic at at worst you know most people might look at it and not really get much from it but obviously it resonated with you can you run people through like some of the choices you make impulse training to like optimize for one of the two and getting the best responses I think the Tweet was about like the intuition of why reinforcement learning with human feedback works when we started uh Lama 2 I had like this budget of annotations in millions of dollar and okay what to do I'm responsible of that I'm accountable for a model at the end that can follow instructions and compete with GPT 3.5 at the time what to do you can annotate supervised fing data which refers to a human to create a pront and to also write itself himself the the answer expected by the model so then you train on that and in a supervised manner uh that's like very classic and standard on F tuning machine learning the other thing is reinforcement planning with human feedback where the annotators type A pront but this time you sample two different an from your model and you are the annotator which one he prefers and then you will train on the preference basically to simplify when you asked to train on the preference of the model that seems very weird and uh not really robust training on synthetic model by generated by the model so I was like let's annotate 100,000 more of supering data and let's annotate a bit of preference to do RF because everyone is doing it and we had this human evaluation after a few weeks in L projects where our model was already better than The annotation from the humans so You' get a prompt you check what the human will have annotate as an answer you check what the model generates and most of the time the model was better I was like oh maybe theat are pretty bad let's look at that and no like the model was pretty good and so I understood the intuition behind the RF like this model are already super good at some tasks and withf then what you have is imagine a distribution a gan distribution which was like basically the tweet and you have on the left like bad outputs and on the right good outputs and the same like medical diagnostics from a doctor you have good outputs on the right and the bad diagnostic on the left but the you have the distribution then when you collect all the diagnostic from doctors hopefully it's mostly on the right there better A lot of time good Diagnostics but human makes mistakes right so there's bad netics on the left you have still a bit of examples which makes like curves not at zero the distribution and the same way for humans like they make mistakes when they annotate and so training on behavioral cloning to reflect humans the model will learn to do also some mistakes just like humans and so you will have some bad outputs from the model time to time reflecting humans and you cannot go beyond that if you train on human outputs but now if I ask a doctor to check a sample from my model a sample from two doctors one Diagnostic and and another diagnostic one is better than the other it's easy for Doctor to say which one is better the same way if I sample from my model that learn the human distrib of ins and there's one bad time to time like humans but most of the time good ins and I ask a human to choose which one he prefers personally I'm really bad at creating poems the example I give a lot of time try to write a highq in fre lines of about lar language models I don't know you take like 5 Seconds to think what you could get come I'm terrible but yet if I check two poems gened by a model so human I can tell which one I prefer I'm good at discriminating and because of that you can have model that Flats the bad outputs and learns to only shift towards the best and better and better outputs and you can even end to super human abilities since that I'm bad at writing a poem but I'm good at judging which one is better so I can actually annotate data beyond my own skills at creating them that's the magic of RF yeah this is we have one episode RF 2011 with a Nathan Lambert from the Allen Institute who was at aing face leading early CHF before and he mentioned one of the things that makes rhf work is that hum are not maybe great at creating a lot of things but they're usually very good at giving an opinion on what which one two they prefer so they're able to actually anate data of things they they would never create from scratch one question actually that he asked me to ask you how much in post training you attribute Improvement to the early Chef side versus the instruction fine tuning side and maybe how you think about prioritizing the two and what areas they impact the most you mean between super F tuning like supervise F tuning annotation and preference annotation yeah so 100% to RF in fact that's quite interesting you start for Lama 2 with a pre Trend model and you have to have an instruction model to chat model over W model is just like continue finishing sentences so you need that to start LF so we had to annotate like 10,000 examples what did we do for Lama 3 you start with a new PR model and then you want before starting the LF to have a now a chat model but is not too bad the option one was let's do human annotation again like sft stage but in fact by the principle I said before The annotation will be actually worse than Lama 2 so what we did is that we generated all the data on the prompts with Lama 2 and we applied like basically the last round of Lama two we had to kick off and start Lama fre post training so Lama fre post training doesn't have any like human written answers there basically almost it just leveraging pure synthetic data from l do you have an intuition on which areas work better for which for example you mentioned the Physicians are expert what about maybe like code or yeah you also have a multimodel working on so like image generation it's like or does this apply to any modality any subject that's an opal sub the intuition General is that like for instance for code because this is factual you can check if the code is correct or not lhf is not the way to go you prefer to do like supervise F tuning as a human to write the code but in fact because humans make mistakes because actually even in code there's some preferences that they might like that and maybe for some other reasons that we don't knowf is so much more scalable It Gos less it's easier then it leads in general to just better performance and maybe we can come with a compromise we actually suggested teacher forcing in L fre a new method that can of fill the gap between not teacher forcing sorry teacher critic thing is they going to train the models teacher critic where it reconciliate and unified super tuning and LF said that when you do human preference and you have two outputs but both are very bad in the code for instance you will ask a human to edit the best answer to make it correct now so now you are doing sft when all the answer was really bad so that you can get out from the local minimum of your mod I think this is like super promising and it seems like there is just well do you have an idea uh you know you started with this question of how much scale you need uh do you now have a better idea no what we know is it's not plateauing yet it's not plateauing yet yeah so just infinite amount more while you know scaly ey and all the The annotation providers are very happy to hear that and uh so we mentioned at the start of the conversation about the alpha go moment and I feel like this is very interesting to reflect on right like uh we're we're basically saying that I think that one of the lessons from alpha go is that people thought that human interest and go would be would be diminished because computers are better than humans but then we have this sort of Centaur model where like humans and computers are actually doing better than either humans and computers would be alone and I think we're seeing that with this what you're talking about this rhf U Improvement right that we're kind of building human preference into the model and like the blending of that the human preference and the model capability is actually doing better than we could on our own I just think it's pretty fascinating it is fascinating the other thing is rhf came came from the alignment community and I think there's a lot of conception that maybe it's like due to safety concerns but I feel like it's like really over the past like two three years expanded to just this produces a better model period even if you don't really are not that concerned about existential risk I always feel like it's so interesting to see this like people who take alignment super seriously they they're the first to consider super alignment and now we're consider like I almost thinking about this as like super quality that we are training models that are higher quality than humans and it's not really about alignment so much as like we now see that this is actually possible yeah and it's not even for alignment purposes we just think it's like better at reasoning better at knowledge better at everything well I don't know how much better yet it is on those but clearly it's super human on some writing skills and it's super useful I think that's great to be honest yeah perhaps we can transition to evals we had some questions about the 400b details that we we want to disclose you know by the time this podcast comes out you know we'll have disclosed them yeah I think last time you disclose like the evals while there was while you were still training what should people know about the the high level headlines from the new llama 3 at a high level it's the best open source model ever it's better than gp4 I mean what version but by far compared to the version originally raed uh even now I think there's maybe the last clad son 3.5 and GPT 40 V performing it and that's it period so for the 45b that's a flagship that's pretty good model not yet the number one we still have a journey to to get there for the 7B and 7B they are like world class model for this size for General models and are The Benchmark numbers from the initial checkpoint still right so the April 15 checkpoint mlu on instruct it's like 86 G Pua 48 umal 84 GSM k94 Matt 57.8 is this still roughly the same performance or you know I haven't seen the numbers yet either we're just breaking the juice right now so no it's trly that awesome so talking about evals we just had an episode with Clementine from hugging face about leaderboards and Arenas and evals and benchmarks and all of that how do you think about evals during the training process and then when the handoff happen do you already know exactly what you want to improve and I know that for example to improve like maybe an arena score you need different than like an MML U score how do you think about prioritizing the the post training Improvement based on benchmarks that's a super hard and good questions there's no good answer like I mean if in an open research problem like in particular when you're trying to take so many capabilities and you know it's also like as soon as a benchmarks you're trying like to to push numbers on a benchmark it stop to be a good Benchmark because then you don't know if you're overfitting it and it will transfer to similar capabilities so evaluation for langage models in particular on post training is very hard uh problem we tackled that uh playing with different methods like reward models evaluation model as a judge having a diversity of prompts diversity of benchmarks as well for a lot of different capabilities that limits the possibility of hacking them of course we do also a lot of human evaluation I do also a lot of model test quality analysis like testing myself some PRS I feel it was much easier during Lama 2 when the model was like worst than today now the model are getting so good that it's hard to get to some prompts to break them and to compare models and see the edge cases so it's getting harder and a great way also to compare models is you know TR uh the different round we have done for lhf every time we upload a new model for All The annotation we are doing we have the win rate between the previous model and the new model by just sampling for every prompt we annotate prefer sample a with the old model sample B with the new model and so we can calculate automatically a win right interesting what are areas that you have to work the hardest to catch up to like the private models maybe like there's you know not as good public data or whatnot or it's performance Improvement just kind of EV across the the Spectrum honestly all of them we are behind all of them with between L 2 and gp4 I mean it's different challenges every time like being good at COD or reasoning is something we didn't do at Lama 2 so we had to build everything from sketch improving on healthfulness and which is one of the main Dimensions that people look at I think in Zena but which is by the way very interesting evaluation because when we did the preview and I don't know yet what will be the results for this new L but we ended very high in this blind test leader board and to be honest I didn't expected that I knew we had good results internally but how that will transfer to perception from the community people like using it in practice and comparing it to the others I didn't expect that positive feedback that High ELO score on this Benchmark it doesn't say like everything as I said before which is interesting because it's a community that judge the prompts and creat the prompts and judge the answers we are limited we are not like good to do that and so it gives you a very good indicator of how good helpful how on the main C of the distribution simple proms about the tone of the model compared to the others but for much more complex PR much more intelligence reasoning coding of complex stuff it doesn't tell the the full story you know like while we had 7B preview at the level of gp4 even B better at the time I think it was partly true but clearly we were not at like gp4 level on COD or reasoning we are know there's some conversation about like the the math score apparently like the next GPT next or whatever is has like reached 90 which is a big big jump from the current State ofthe art it will be interesting one of our previous guests rounding out the topics on like just potential models areas of development and evals Clementine is looking for a confidence estimation or uncertainty Benchmark one of our previous guests Brian Bishop is also asking about like how do we think about evals for practical things like confidence estimation structured output you know stuff like that yeah I think we lack a of such evaluations when numbers I was suggesting like two days ago to the team to to report at some point is okay we have this accuracy on mmu on whatever on math and j8 for what if we CH a bit the PR and instead of telling the model you have this question you have to answer A B C or D what if we tell the model you have to answer A B C or D or you don't know and maybe the accuracy will be a bit lower but I'm curious to see if some models we have different of calibrations where maybe Model A have 50% correct model B has 50% correct but model a inser 100% of the questions so 50% are not correct model B actually said like andly 60% so for 40% of the time he said I don't know I prefer model B and we are not like reflecting that in Evolutions I think this is very relevant for post training in particular because it seems that the general consensus is that base models are more calibrated than Post train models right something like that exactly that seems to be the research from openi as well I I don't know the degree of this and and like maybe we can invert it right maybe post training can help to increase calibration rather than decrease it I feel like this is a little bit of being too similar to humans because humans are not calibrated very well yeah that's a goal of post ring I think to to make models more calibrated to not be biased toward like entering ABC or D as often as possible to follow the uni destion and on the structured output to calling side do you think that it's not a explicit part of the evals obviously you you know you worked on two former and you on the language augmentation like do you encourage the the open source Community to f t Lama 3 to do two calling or do you want to just have that in the model from day one we have that from day one good news for the community we are state of the out there I think the model will be pretty good at at that we we have a lot of gems about tools in the paper but the model is fune to do tool usage to zero shot function calling there's some system prompt if you tell the model to do it can use a search and Imagination can do a lot of stuff like codee execution as well even in a multi message way so almost multistep agents which kind of spars of Agents okay you talked about agents so I I guess we should probably mention the the work on agent stuff and you also you know pre-c conversation uh mentioned that you're already starting work on llama 4 what does agents have to do with Lama 4 how does your work on Gaia you know inform all this work yeah you know so we published one year ago G generally assistant Benchmark that followed the Direction I really like pursue like I mean everyone passionated about Ai and trying to build Javis will go there so I did to forer in this survey on augmented models in fact you know reflected back I was okay we have Galactica we have uh Lama one we have tool forer and there's like GPT 3.5 as a time and then4 if you don't have a good instruct model to follow instructions the extension and the future of tool forer is limited so we need to work on that and we did LMA 2 and then now Lama 3 and it's very interesting on General assistant Benchmark so Gaia agents powered by language models perform to zero with GPT 3.5 and to something very significant like 30 40% 60% with gp4 so there's a gap of intelligence here and I think this gap of intelligence this thresold that you pass in term of zero short function calling following complex instruction that can span over page of constraints all those things that makes the Cur the nowadays agents with the react Loops pre-planning multisteps reasoning function calling work in practice is like this gap of intelligence so now that we have lre I'll be back two agents I expect some incremental and significant progress on pre-training post trining but I'm really hopeful that we can gain some order of magnitude of scaling by interconnecting well models into a as a more complex system that can do planning that can do backtracking that can take actions navigate the web execute code okay there's a lot there when you you say integrating World models um is there anything from jepa is that like is that something that we're talking about or is that a different line of research no not directly let's same goal I would say but japa is very very fundamental research which has some promising early results and what I was looking right now on state of theart results on Gaia there's a leader board by the way you mentioned Clementine before she contributed to G as well and face put a leaderboard there on the website there's a some state of the results what is interesting is like GPT 4 alone has 0% but or like 5% I think on level one there three level of difficulties but Osco pilot then and uh autogen from Microsoft and recently hugging face agent obtains on level one up to 60% so connecting an llm to an agent that can do all those things moves much forward new capabilities this is kind of a breakthrough and those models are purely based on instruction tuning models following instructions where like you have an orchestrator and you said to your LM okay this is your task you have access to the tools can navigate the web can you do a of what you should do and then okay that's the plan now execute the first step did you manage to succeed for the first step or do you want to reink your plan because you enter in the Denim and you have kind of all this orchestration by System prompting instruction following and just B which is quite suboptimal and probably you need to go later in lat on space and more Jaa style but just that is getting us to some really impressive receive results already and do you see the planning and review to always be needed in the future this is kind of like under garpa idea of like more tokens equal more thinking so like the more you're having it write tokens and like think about the the outcome and like the better result you're probably going to get to do you think that's always going to be the case or that in the future like the model you can just say this is the task and then I'll just return the answer directly and do all of that in the in the l space so to speak right I think in the future it should be it should hopefully go more this is a task and I return it but we need to teach that to the model to train that Which is far from now very medium long-term directions that could be really relevant here is thinking into latent space I know some early works are doing that and that's a way probably to move to First you think and then and you don't have to write all the tokens like it's in your head doesn't have to be as constricted than a plain text PM and once you done your thoughts you can just write the final answer or take an action just a commentary on that anthropic actually cheats at this right now if you look at the system prompt in in the claw artifacts I actually have a thinking section that is explicitly removed from from the output which is I mean they're still spending the tokens but like uh that is before training it is you know it's the at the prompting level you can you can simulate this and then at I clear there was like the pause token the backtrack token I feel like all these are to level stop get measures I feel like it's still not the final form like we still need to have at the architecture level some kind of variable inference length thing that lets you actually think in Laten space like you're talking about I I don't know if you there's any papers that you're thinking about no but that's interesting
Original Description
Llama 2 lead and Llama 3 post-training lead Thomas Scialom of Meta/FAIR, on the Chinchilla trap, why Synthetic Data and RLHF works, and how Llama4's focus on Agents will lead us to Open Source AGI.
Chapters:
00:00:00 Introductions
00:04:16 The Llama Origin Story
00:07:34 Are there RLHF scaling laws?
00:09:56 Avoiding the "Chinchilla trap"
00:12:15 Why 405B?
00:14:27 FP8 training and other scaling research
00:17:48 Llama3 vs Llama2
00:18:32 Synthetic data for pre-training
00:21:43 Tool use to generate synthetic data
00:22:40 Pre-training data recipe
00:26:00 Why not MoE?
00:27:05 Why RLHF is so important
00:37:06 How they eval models
00:41:50 Benchmaking Uncertainty
00:44:04 Structured output and tool calling
00:45:52 Llama4 & Agents
00:52:01 Will Meta keep releasing open models?
00:53:55 Why tokenizer vocab size is underrated
00:59:12 AI & Startups
01:03:13 Hiring at Meta AI
Please support our pod by subscribing on https://latent.space/ and Twitter: https://x.com/latentspacepod and spreading the word!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Latent Space · Latent Space · 35 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
▶
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Ep 18: Petaflops to the People — with George Hotz of tinycorp
Latent Space
FlashAttention-2: Making Transformers 800% faster AND exact
Latent Space
RWKV: Reinventing RNNs for the Transformer Era
Latent Space
Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai
Latent Space
RAG is a hack - with Jerry Liu of LlamaIndex
Latent Space
The End of Finetuning — with Jeremy Howard of Fast.ai
Latent Space
Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue
Latent Space
Powering your Copilot for Data - with Artem Keydunov from Cube.dev
Latent Space
Beating GPT-4 with Open Source Models - with Michael Royzen of Phind
Latent Space
The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis
Latent Space
The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph
Latent Space
The AI-First Graphics Editor - with Suhail Doshi of Playground AI
Latent Space
The Accidental AI Canvas - with Steve Ruiz of tldraw
Latent Space
The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert
Latent Space
The Four Wars of the AI Stack - Dec 2023 Recap
Latent Space
The State of AI in production — with David Hsu of Retool
Latent Space
Building an open AI company - with Ce and Vipul of Together AI
Latent Space
Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal
Latent Space
A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate
Latent Space
Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI
Latent Space
Making Transformers Sing - with Mikey Shulman of Suno
Latent Space
A Comprehensive Overview of Large Language Models - Latent Space Paper Club
Latent Space
Why Google failed to make GPT-3 -- with David Luan of Adept
Latent Space
Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI
Latent Space
Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit
Latent Space
Breaking down the OG GPT Paper by Alec Radford
Latent Space
High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor
Latent Space
This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)
Latent Space
LLM Asia Paper Club Survey Round
Latent Space
How to train a Million Context LLM — with Mark Huang of Gradient.ai
Latent Space
How AI is Eating Finance - with Mike Conover of Brightwave
Latent Space
How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)
Latent Space
State of the Art: Training 70B LLMs on 10,000 H100 clusters
Latent Space
The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka
Latent Space
Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI
Latent Space
[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models
Latent Space
Synthetic data + tool use for LLM improvements 🦙
Latent Space
RLHF vs SFT to break out of local maxima 📈
Latent Space
The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)
Latent Space
Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson
Latent Space
Answer.ai & AI Magic with Jeremy Howard
Latent Space
Is finetuning GPT4o worth it?
Latent Space
Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind
Latent Space
Building AGI with OpenAI's Structured Outputs API
Latent Space
Q* for model distillation 🍓
Latent Space
Finetuning LoRAs on BILLIONS of tokens 🤖
Latent Space
Cursor UX team is CRACKED 💻
Latent Space
Choosing the BEST OpenAI model 🏆
Latent Space
How will OpenAI voice mode change API design?
Latent Space
STEALING OpenAI models data 🥷
Latent Space
[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!
Latent Space
[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval
Latent Space
The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org
Latent Space
llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE
Latent Space
Prompt Engineer is NOT a job 📝
Latent Space
Prompt Mining LLMs for better prompts ⛏️
Latent Space
The six pillars of few-shot prompting 🔧
Latent Space
Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph
Latent Space
[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)
Latent Space
Can you separate intelligence and knowledge?
Latent Space
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
Chapters (20)
Introductions
4:16
The Llama Origin Story
7:34
Are there RLHF scaling laws?
9:56
Avoiding the "Chinchilla trap"
12:15
Why 405B?
14:27
FP8 training and other scaling research
17:48
Llama3 vs Llama2
18:32
Synthetic data for pre-training
21:43
Tool use to generate synthetic data
22:40
Pre-training data recipe
26:00
Why not MoE?
27:05
Why RLHF is so important
37:06
How they eval models
41:50
Benchmaking Uncertainty
44:04
Structured output and tool calling
45:52
Llama4 & Agents
52:01
Will Meta keep releasing open models?
53:55
Why tokenizer vocab size is underrated
59:12
AI & Startups
1:03:13
Hiring at Meta AI
🎓
Tutor Explanation
DeepCamp AI