[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Latent Space · Advanced ·🧠 Large Language Models ·1y ago

Skills: LLM Foundations90%Fine-tuning LLMs80%Multimodal LLMs70%Prompt Craft60%RAG Basics50%

Key Takeaways

The Llama 3.1 paper discusses the Llama family of models, including scaling laws for pre-training and post-training, and the development of the 8B and 70B models. The paper covers various techniques used to improve model performance, such as data augmentation, curriculum learning, and synthetic data generation.

Full Transcript

and I kicked off like uh what do people want to work on Section with I'm going to do a deep dive in the paper because I need to make slides on this and that very much overtook the hackathon we had like a solid crew of like 20 30 people that were just discussing the paper with me and slides didn't really get made but also AR was weird hackathon project winner was just a deep dive into the paper but uh we had a in-person paper Club session that I led yesterday and a lot of people from there are trying to join in so you should be vibes um I am liking inperson hybrid format I might start running those we we'll see how they go but it was good everyone had good discussion amazing amazing um yeah I would be happy to join that once I H back to SF this Friday um oh Friday exciting yeah so uh as as you know um we also we interviewed Thomas who was one of paper co-authors uh he did not give us the paper beforehand which is annoying because after reading paper have so much better questions than we actually end up asking in a podcast but whatever um yeah yeah I think you know a bunch of us have read it um I I feel like vibu you're probably best situated to um to take over the screen if you want if you have if you have stuff sure um I have very basic stuff yeah sure perhaps we also got someone at the hackathon that worked on a hackathon project that's paper to video so someone's cooking up a video explainer of this it's like literally doing infint right now now we'll share it once it's ready yeah but yeah this is I mean this is uh I'm yeah excited about it but also I'm wondering how to do with this Justice I feel like we can uh post questions in here and then uh you know people can just kind of discuss in in the the zoom chat um and and yeah I mean like I we classically have a lot of side discussions in the zoom chat anyway so I'm not worried about that um yeah I mean yeah while while viu uh ahe you can get started um but like what do what do people think what people want to talk about um you know personally I I I called this the synthetic data paper um so I I have a lot of like interesting insights sort of questions about the synthetic data stuff but we can talk about everything like there's just so much in here the format that worked well yesterday was like we're not getting through 100 Pages I'll give the high level we'll go through the tweet overviews and then let's just dig into whatever anyone found interesting and had like you know something that someone dove into so like part of it was they had a bunch of scaling laws for pre-training they had scaling laws for how they picked 405b and 15 trillion tokens so whatever someone chose to dive deep into is what like I was like okay we we'll dig into that also other part of this I'm probably gonna give like a longer one hour breakdown if like I'll go through the whole paper have like an hour talk at some point so I started slides they're very not ready these are like 20 minutes of slides just became discussions but basically uh we'll spend like two minutes on overview everyone knows llama um interesting stuff that so like they dropped three to 31 31 was a pretty big update the 8B got a lot better the 70b got a lot better A lot of this is just for other talk but um yeah they dropped some sizes the context been getting bigger we thought their scaling laws were just overtrain and prey but um no they're they're actually pretty grounded in real um scaling laws uh they're all dense models after reading the paper their whole justification for this was like we want to see stuff that scales it's the first actual research paper where they talk about everything Hardware inference Hardware failures what happened how they fixed it um so real research paper there's a lot on it it's like basically everything pre-training post-training their scaling laws how they run experiments it's a great recipe on how to build it that's what this talk would be later when I do it um they cooked model is really good it's solid open source it's like gp4 l they talk about how they bring up performance um at some point we'll probably you know discuss performance so everyone has their thoughts on benchmarks if anyone wants to pop in for a sec and then we'll just go straight to paper um also for anyone that finds any time to cut us off cut us off it's all Vibes um but yeah we can see the jumps for the 8B basically from 3 to 31 it got better all around um overview of the paper there's a bunch of Twitter threads so in instead of me making slides we'll go over like the main one shared in Discord for everyone that hasn't seen also is my whole screen sharing or is it just okay let me share my desktop real quick so for people that are new and not in Discord um we have a very active sh running llama 3 section I I'll share the paper so um if we go to this little like you have to find it so you got to go through the news and find the Llama there's like 60 links that we've posting of everything popular on Twitter so we'll go through these at some point but paper overview is basically like they have multiple phases of training um I'm very not ready I do have other notes on this that I'll share it through so I started screenshotting stuff um they have like three aspects to a good foundation model basically data scale complexity this is why they didn't go into an they wanted training stability basic two-phase training there's like pre-training post training um they they do a lot of scaling law work so in their pre-train data set how they how do they determine what the pre-training mixes in the postra in the pre-training they start doing most of the training at like low context then they continue pre-training at long contrast um a lot of what they said in their complexity section was like we want to do basic stuff that will like scale up this is like the foundation for how we can like redefine scaling laws and train this stuff up so no crazy complex RL just you know sft for chat tuning and then they do a lot of models for rejection sampling to do data set stuff DP um a lot of that is in the post training that's where they start to see capabilities added they have little n sections like they have this um Post train on like data set stuff so like Post train on benchmarks and that normally help small models but this was the first time someone got to do it at a 400b scale right so like they post trained on stuff similar to GSM 8K and the small models had a big Improvement the big one it kind of did it so kind of talks about how like the scale benchmarks and like their heldout test sets why maybe the big Geminis and stuff don't do it they added their safety [ __ ] at the end um other interesting stuff that didn't make it to Twitter was like they had multimodal experiments they trained in adapters they have a vision adapter audio adapter stuff um there was cool sections on their pre-training mix so basically they they use a lot of like traditional filtering techniques so they have like Roberta based um filters for high quality they have this for their synthetic data distribution like how do we extract out high quality data then they have a lot of traditional NLP for like pii text extraction they had a whole section on like how they scraped the web and how they trained their own parsing HTML they compared it to what's out there and their stuff's better there's a lot in this paper uh data mix was a really interesting section as well so they basically go into um here's basically what they did D duplication all this stuff that you would expect uh modelbased filtering was pretty cool they used a lot of like they trained classifiers on llama 2 outputs on the synthetic data side Eugene has a great tweet thread we'll probably go through it at some point um this was an interesting section that we haven't seen before so when you have like a base model that you're pre-training and you have like 15 trillion tokens how do you determine what the right mix of that is so their finding was like half the tokens are general knowledge 25% math and reasoning 17% code all this stuff but they're like this is the first research paper that actually breaks this stuff down they actually did like scaling law experiments so they trained small models that were like a couple billion parameters they started testing different data mixes and then they train a large model to see what actually works on what's the right data mix and then they're like here's the answer for this stuff um model architecture was pretty similar they like did a few little changes they they did some better attention masking group query attention here's architecture all this stuff is like on Twitter so not as interesting um from the podcast that Sean had the vocab section is pretty interesting they're like instead of messing with tokenizers changing vocab is pretty big for small models check out the podcast or if it comes up in discussion we'll discuss it scaling laws was another interesting one for the paper itself um basically Trad additional like chinchilla scaling laws used to have this whole like they're predicting what's the optimal for your compute budget like what's the optimal model parameters all that stuff how many tokens you train on we thought that they were just scaling and praying and trading like you know fixed cost training run for cheaper inference but this stuff is actually grounded so they developed new scaling laws where tldr of what they did is previously we used to we used to do scaling laws where we're just predicting on next token prediction accuracy right so we're trying to predict on like perplexity and just how good is next token prediction instead they do all this fancy math and they change the training objective to be like more representative of a reasoning Benchmark they use the art challenge where basically they have a reasoning Benchmark and now instead of doing scaling laws to predict next token production they've changed it so that they're doing scaling laws to predict optimal model stuff based on actual reasoning and that where they come up with this like their scaling laws show that for a 402b model you want to train on 16 and a half trillion tokens based on that they did a flagship 405b based on 15 trillion tokens and then this is where they have their like infro Optimal where they started to do the 8B the 70b they just reused their 15 trillion tokens and just overtrained and that works the other really cool section the sections that didn't make it on Twitter were like their training infrastructure structure so they give out everything right they give out like the full pre-training stack of like they have a section in here on how they do their pre-training So like um one is like the whole Hardware configuration so 16, h100 hours what failures they hit why they went for Simplicity this was a pretty interesting section like over their 54 day training they had like 400 job in uh interruptions 419 unexpected interruptions and like 78% of these were like GPU hardware issues and then they have a section on like if they did all this stuff compound so we just wanted something like simple scalable that we could deal with well and like this is stuff that you don't really see in papers anymore right it goes further with like what is the pre-training set so like these formulas we don't really see anymore right so it's like when they pre-trained it here's their like Peak learning rate here's their warm-up here's their DK here's how many training steps here's the bat size little nuggets like this haven't really like come up on Twitter yet but like you know at first they have a bat size of 4 million tokens with a small sequence length so like the first bit of training is a sequence length of 4,000 then they double it to like eight million sequences at 8,000 for the next 252 million tokens after they've trained on 200 million tokens they double it again to like larger bat size for the next three trillion tokens and then they do most of the training at 8,000 token sequence l so like little stuff like this I feel like we still need to digest there's there's reasons for why they did this but basically tldd no other open- source paper has like a formula like this and then that's kind of what the next like 100 pages is I feel like at that point instead of finding what I found interesting like I found all this stuff really interesting they talked about the batching GPU utilization memory like utilization all that stuff like Cuda optimization their whole training recipe um what they released performance stuff instead I feel like that's enough of a highle overview of the paper the more fun stuff is like yeah so how does it perform uh they're all better infra companies are pretty cheap and this is also where like everyone else can hop into discussion um Eugene Sean other Eugene hop in now um you know fireworks is somehow really undercutting inference price the scale leaderboard is a held out leaderboard it does pretty good here um what else grock has it so some Insider info for all the infra companies they gave access to randomized weights that were the same size about six days before launch so six days ago infra company started playing around with it they started working out how they're going to do inference what type of decoding they need but they didn't have the paper they didn't have the actual weights and then day of they released weights but like yeah stuff like grock is serving at a th tokens per second um what other discussion that we have here Kyle did pretty good evals on um performance he started doing it on his own fine tuning stack so he started fine-tuning it um compared it to 40 mini opening ey within hours responded with like 4 mini fine tuning but fine tuning the Llama 3.1 8B is kind of on par with 40 mini foral mini fine tuning is kind of broken and free but it gets worse what other fun stuff um there's a comparison of model pricing here that's being updated live other tweets George hods karpathy tweeted um FM supports it other more independent benchmarks coming in basically it's good the other interesting part was the licensing so they changed up their like llama license to proper full open source everything um we have more infra providers Nvidia stuff but yeah that's kind of where I feel like we should open it up that's the quick 15 10 to 15 minute overview whatever people found interesting like I know there was a lot of discussion about synthetic data gen Sean and uh Eugene you had good tweets about this so I think this is where we open it up to whatever people found interesting and then we dig into those topics because we're not getting through the rest of it um I'm gonna open up chat and see what people are up to but yeah thoughts everyone pop in yeah I want to jump by the way one thing to warn about pricing is that uh you're going to see a lot of providers jumping in and everyone's just trying to get the piece of the PIP so so so like with some of the previous model launchers you see people coming in at lower and lower price and then they'll increase later but I want to jump in on the training side because I'm quite sure uh weu eug and it will have lots to say on the data so uh I think I'll start with that um I can't share the screen by the way do you want to do you want to take over or you want me to scroll uh I want because yeah because I want to jump through a few M uh a few things there so let me share my screen all right so I didn't see uh too much uh too much talk about on this but uh but for me right one of the big ones is is actually pipeline parallelism uh not not sure how many people can you see my screen yes yeah so so uh so if you're looking at this and like what is this crazy freaking schedule that they are doing here but uh tldr uh pipeline paradism is the generally the idea of like scaling up your training across multiple gpus instead uh and and to and to build around optimizing that that has its own benefits uh it has also it own downsides uh and the the major downside that the reason why people try to avoid pipeline parallelism at all cost uh and they use like deep speed tree for example where the weights are shattered around all the other gpus is that is that if you look at pipeline paradism or model paradism there's this problem called the bubble the bubble is basically as as as your data set goes through the the different devices there so the forward pass and then the backwards pass uh you have all this GPU time here where some of the gpus are waiting for other gpus and are doing nothing and basically they wasting compute and because uh because everyone want wanted to avoid wasting compute that it went on to to a to a search of like uh the algorithm to figure out how to do pipeline parallel and one major one is actually uh uh sale SG coincidentally Singapore where they created like this crazy ass algorithm right to basically train train without any waste of time so you see the gray the gray spots are with the wasted time respectively and Facebook is now now embark on your own Journey on this and the reason why this is exciting even for for smaller models is that this kind of aloric changes on the training right is what's going to allow you to train bigger models easier on lower and GPU so this concept could apply to let's say training a smtb model on 24 GB G uh gpus and things like that and the reason why they probably need it for for the 80GB is because they're training 45b and yeah and and a lot of people thought like Academia thought that this was a treated it as a dead end because of the bubble problem and then Facebook is like you know what we are going to do that and and that to me is one of the more exciting thing uh the the other one that I saw some people tweeted out is about B sizing being smaller coning on supp I thought Google has pip and parallelism in their Jacks distributed training repositories they don't yeah they do they do uh but the thing is no offense to Google no one really took uh everyone just interpreted it as TPU has 2 lit vram kind of kind of kind of thing and they had the basic pipeline parallel but which you suffer from the bubble problem the this weird scheduling which I'm quite sure they people are going to start replicating it is to reduce the bubble the was I also saw lots of papers on this from um maybe Nvidia and mesaria from Berkeley or Stanford like they had lots of interl pipeline parallelism updates corre so so you're saying no one is using it just Facebook has used used it more recently I I I find that pretty oh at least no one published that with within the their training processes because this is the first major model that of this S Class size right there saying hey we are doing pipeline paradism Google models so they have some um uh these Pathways distributed training architecture systems and they publish in uh maybe osdi which is kind of the biggest distributed systems conference so they publish these trainings and they can do all sorts of parallelism within their systems and even mixture of experts parallelism and and stuff like that so they do quite quite heavy stuff I'll look it up and post some papers if I find them um in the in in the messages but yeah this my mental model was that people are actually doing this at scale thanks yeah so I I'll draw the distinction between peline parm and techniques like deep deep speed tree which is essentially where the GPU has uh uh EnV link connectivity to other other gpus to actually read the the model weights P parm is really more of like instead of going cross GPU to read the the weights of the other models or the other half of the model you actually just focus on the half of the model that you that you're working on and and the and this has the the the tradeoff respectively of saving vram and allowing you training larger model and larger back size but it means you have the Buble problem and I I think the focus is really more about the Buble problem here rather than rather than than anything else and yeah like I say I I I do expect more people to replicate this part yeah so uh that's the part that I wanted to jump in on the other major one I want to jump in on uh is just multilingual uh I'm so happy that I've seen this we try to avoid using machine translated data to find in the model um and um this is something that I think multiple people know that I've been shotting on the roof about saying hey can we stop using machine translated data for other languages and then assuming that's great because when you speak to the other language uh native speakers they've been saying that sucks and finally someone is also at least on the bigger bigger model site is doing that as well so particularly excited about that but but yeah I think I hand off to the whole data stream the interesting little section there of translated data is I've still seen it used where like they have a llama 3 filter that extracts out what's the highest quality data it's the highest quality reasoning code data and whatnot and in other work they'll still do very this is like very traditional pre-training data set stuff right where you need more data augmentation to get more high quality data and translation so like one thing is you can train on multiple rounds of that right it's like more Epoch on high quality data so you can just resample it but then there was a paper that I'm forgetting that tested this do they want to only use a little bit do they want to train on multiple rounds of passrs of the same high quality data or they do they want to do basic augmentation like translate and translate back and somehow translation through other languages is work better like that was the best option translating it to high quality in another language as opposed to translate and translate it back so there's still like some value but interesting little piece yeah so uh I think I want to hand off to the people who are going to te all the data Parts into bits because I just wanted to jump in on train like because that's what I can uniquely offer awesome appreciate that I think Cameron has his hands up hey um did they make any claims around it being good for code generation I'm interested in whether yes versus versus Cloud yeah uh they like uh this is a big contrast to llama 2 where they were intentionally not trading for code and then they put out code llama separately uh now they explicitly outline code as a separate modality like separate from text uh B I don't know if you have a slide on this stuff uh and then they also did synthetic data for code as well um yeah they just uh they they spend a lot more time on coding this time around has anyone looked at it versus Claude 3.5 son it yet H yeah we Vibe checked it yeah they did what check Vibe yeah so like you know it's not rigorous evales but like we've VI checked it and like it it does pretty good so in the paper they did explicitly mentioned as well like yeah they used to have previous um they they used to have previous code llama models right and part of their like second step of post trainining was to add in this section on code but they explicitly no longer need to do that and I I'll pull up the section of the paper basically but they they mentioned that this is like natively trained in in pre-training as well but um it's it's a good code model they also have a um the the base mod instr yeah Jeremy can you repeat uh we can't hear you very well okay yeah there's a scale AI Benchmark where um Sonet and 40 were compared against the new 405b model and uh 45 VI is found to be basically on par with gbt 40 which is worse than both Sonet and gbd4 turbo preview um uh there's a tweet thread and a r comment that I'll just drop um but it outperforms Gemini 1.5 and the thing I like about the scale benchmarks is that they are fold out that is like none of the companies have access to them and they're private so there's probably more durability to the benchmarks and they don't have as much of a conflict of interest though they did c-watch with llama and so um yeah there may be a little bit of conflict of interest thank you thank you Jeremy um so overview SC go ahead scale leaderboards aren't just coding so for people that don't know it started out with the the GSM 8K where they tried to recreate it and they made a GSM 1K which is meant to match the actual Benchmark and just be a held out that they'll run models they'll evaluate them and then that turned into now they have held out benchmarks that no one can see what the actual examples are of coding instruction following math Spanish there's a bunch of these and yeah they're they're kind of like pretty good the sense of like no one can directly train on them there was a piece that said like when they put out their first one what's the Delta between companies like models that do really well on traditional like GSM 8K but don't do well on 1K where it's like they haven't seen it before so they basically tried to test who over fit to the benchmarks and this is trying to solve that so if we go through real quick this is kind of where the 405b sits in coding it's like a step right below um gp4s and Son it son it still slightly better and then we can kind of go go through it I think they're still testing the 405b because I'm not seeing it through the rest of them but um they're being they're being updated in tweet threads and whatnot and then Jeremy shared a link to the Reddit that talks about this where they're they're basically going through them and then there's discussion here if anyone's interested but yeah someone was also talking in thanks very much I had something to share the coding evaluation um it seems like they so there's I can't share my my screen uh but basically human eval um human eval is a kind of one second let me try and share it um can you see yeah so human eval um is one of the Benchmark data sets that people use to to Benchmark coding and it's very simple like they have 150 questions and it's almost like autocomplete like Sol this simple puzzle in python or things like that it's very like one two lines and um you can see that let's see so the Llama 405b is not state-ofthe-art so Cloud Sonet bats it by a few percentage points uh it's close to the GPT and Sonet model but slightly worse and I think this kind of conf is similar to the vibe checks my understanding was on the initial llama stuff that um meta didn't Focus that much on reasoning or on code because they're a social company so maybe reasoning is not as super important uh for them but then they hacked U focused coding data collection session and send Shar the big code model which kind of wasn't that great maybe if you don't put the data in from the beginning um just trying to find you on code um by itself um doesn't work that well the other thing I wanted to share can you see this other page uh now um basically it seems they spend quite a bit to to make their coding much better in in Lama 3 um and uh they actually train the code expert and then uh try to use that code expert to maybe um I guess collect high quality human annotations and and do some more more post training and then they also did some synthetic data Generation Um uh to to improve coding so I think they spent quite a bit to to work on reasoning and coding I didn't read this section carefully but yeah they have a full section on on um trying to get better code data to generate to incorporate feedback um and do analysis like they they did quite a bit on coding um yeah yeah there there's two sections there one is the synthetic data gen with coding and the other is the pre-trained mix of their code and reasoning sample where they also have they trained a second classifier it's a so like one of the takeaways there was like when you're doing pre-processing of 15 trillion tokens you actually can't just run into inference of like even llama with all even meta with all the gpus they have they couldn't afford to just throw llama 3 inference as like this whole 15 trillion token set so they trained like a code and reasoning classifier on distill Roberto which is like a small original encoder decoder Transformer to try to like annotate out their web scrape data for quality and whatnot um so they have it both there in the pre-training set and in the synthetic data gen there's a really good um quote tweet that went on about all this code gen it's by uh Eugene I will share screen and throw him on the on the stage if he wants to talk about it yeah thank you um I'm currently commuting but I I'm finding a quiet space right now all right great thank you vibu um yeah I can I can talk to it I think can you hear me we come to it in a few minutes if yeah if you're if you're commuting we can come to it in a bit we can talk no I'll be commuting for a while to the stadium right now team event but I'm finding a good space to fit okay so I think what really stood out for me in this paper was that how much Automation and augmentation was was there right like in the first one you can see they actually use llama to to filter out bad data right if you see um and this is in the pre-training step so essentially what they're saying is that hey we trust lamaas Tool's judgment well enough to be able to do that and then you if you scroll down next slide and then over here you can see that they actually trust Lama tree to do tag intention they actually tag um tag the generated data or the the responses based on intention and they also classify Things based on difficulty right and and the thing is the diffic the more uh they actually adopt some kind of curriculum learning where at the start they start with a single shot prompt or rather a a single a single turn uh prom and response and then after they move on to multi- turn next slide and then after that uh and and this is the this is the this is a code expert that everyone's been talking about right so what it means is that in order to get Lama good at code as an intermediate step they had to train a code model and that sounds quite crazy right I mean for me I mean sometimes training such large models just seems so to take so much effort to curate the data um to set the info and everything but it seems completely essential in in this case they they could not have done it without that and Andre kapati had a great tweet about this whereby um every model distillation and model and synthetic data generation is really now a stepping stone for the next better model next please and then the same thing here is oh okay and of course here this is just an example of how how much you trust the synthetic data right uh uh the model was prompted to generate problems then solv problems so I'm focusing only on the green highlights here solv each problems and then they give the model the errors and then they ask the model to solve the errors and then the model also D the unit test which they then use to evaluate the Generations on the unit test itself it's it's like you see that the human is very minimally in the loop um and then if if you move on um and and you see this pattern everywhere like multilingual you talk about it they they use they one thing that's interesting here is that they generate they use llama to generate data for Target capabilities and then they back translate into doc strings and comments so that's how they can teach the model to uh explain code and then they use those tweets and comments uh uh those doc strings and comments to actually create code again and then we're going we're going to go through the rest really quickly it's like multilingual the same pattern here math and reasoning the next one you see it's the same pattern whereby the model actually augment the training data with the step by step so one thing that's really interesting here right in the sense that they actually went the extra step to no pun intended to actually train stepwise reward models that's kind of crazy no I mean they they wanted each step in the train of thought to be so good that they actually took the extra effort to change step to train step wise reward models which they then comp combined with Monte Carlo Tre search to um to improve the reasoning traces and then you see long context for uh uh synthetic data for long context is the same pattern Q&A and then you as you scroll down you see synthetic data for image captioning and uh synthetic data for factuality all like factuality essentially all all of it just synthetic data if you look at this uh I think time will tell whether this really works out well or not I think we still too early on the evals and then you see synthetic data for ADV examples synthetic data for the image and coder training uh where they use image captions and then they augment as exting data set with new instructions and responses and what's really interesting here the second last tweet is that there are human annotators were actually um augmented with model in the loop right and and if you think about it this slightly represents a shift in how some folks are thinking about it right I mean a lot of people is like thinking human in the loop but no now it's model in the loop like I very whereby you use the model to to create an initial generation that the human then can add it and it just makes the it is so easy for the human right and then the one big takeaway from all of this um and and that's all I had from this paper but the one big takeaway from all this from all this is that can you imagine how much meta had to educate and upskill their sdes or their existing scientists to use this new technology to be trusting of it and the annotators to to trust the new technology and and to just just work based on that so that was quite eye opening for me and I think it sort of suggests the longterm here's where the park is heading um and that's all I had thank you awesome um thank you VIPs Eugene coming in clutch with I threw him on the spot he's commuting and already had slides and tweet thread um but yeah uh what other topics have we got I've got chat did they mention did they mention using chain of verification prompting Eugene um do you mean Chain of Thought prompting or chain of verification where they try to verify the chain the lad okay I don't think they actually did that but they did mention they had step-wise reward models that actually checks checks every step in the chain of thought so but I did I don't recall seeing chain of verification sorry okay thank you welcome Eugene uh like early last year there was three of thought and some iterations with Monte Carlo uh search and this three of thought stuff but uh at that point llms weren't good enough to verify uh or provide enough signal for this multi-step reasoning things to happen and and things in the loop do you know do you have some idea how they solved it or why they were able to make all this progress basically Lo use all the the tricks we were reading about maybe half a year a year ago but it seems they they actually got them to work so that they yeah I wonder what made it work yeah I don't know I'm very interesting I I'm very curious about that as well I wish there was more papers showing how to use uh Monte caros research and actually get it to work and share more details about it I'm afraid I haven't I haven't seen too much of that in this current paper is it mostly for coding that they they employ this or for other tasks as well cuz for coding you could signal back some reward but for other things like I don't know how you evaluate things and propagate information and validate the chains of that yeah if I recall correctly it was actually in the math and reasoning uh section so whereby they actually use stepwise reward models to evaluate each step to score each step in the chain of toys um so that the final output gets better thank you I'll into it more thank yeah it's it's the math Lightman at all is the uh the citation and and I guess I'll wrap up one final thing I'm sorry it's a bit noisy I think yall should I think um swiers Laten space podcast with Thomas he really goes really deep and he's he has a strong opinion on synthetic data right I think I think listening to that podcast will give you a lot more insight into how meta is really embracing synthetic data um so I I found that podcast quite helpful and this was the um this was the carpati Tweet about synthetic data also yeah great podcast I think that's the one um exactly wait a minute I think that here one thing in the in the sense that everything is a step for the next one no uh uh no not not this one it he it was actually a tweet about smaller models about how the competition for smaller models is going backwards but very in there uh if it scroll down a little bit more this one yeah this one yeah this one exactly you can see the models have to first get larger before they get smaller right and the three line paragraph and then it's a staircase of improvement where one model helping to generate training data for the next it's almost like he had read this paper up front uh and he was eluding to that I don't know yeah it's pretty interesting to see also this this was a tweet that came out even before but like very much the small model distillation work it's like pretty huge and that's I think the big part of the license play in this too where um they they did actually finally change their license to allow people to generate synthetic data train on outputs of the 405b I think the 405b is a little overhyped for just using it for inference like when it comes to inference generation and like like cost Effectiveness mixture of experts are pretty efficient right they use less um they use less Ram at inference time and like they're they're more cost effective for just the best quality but then this is like really valuable for synthetic data gen for filtering stuff like that and that that's what I see more of it um I know Sean Twix you also had a pretty good write up about this um sorry over that or any other about uh how this is a synthetic data gen model or any other any other topics we want to dive into also open to like everyone else that's in the call too if anyone had anything interesting that they want to dive into on the paper pop in share your thoughts you know yeah I think sain's hand has been raised up for quite some time yeah sain go ahead yeah yeah so this is for um Eugene and viu also so we saw Sonet actually take over some time back right the so and there's still some what do you call Gap to cover so does Sonet have something some other tricks in their back which is getting them that higher up uh I know someone's working on a write up about this no no no I I I I abandoned the idea um yeah so they they never published anything about um the uh the what tricks they use uh but the evidence strongly points to the fact that they use the steering vors that they had from the scaling monosan scaling monos semanticity paper um the the main evidence is that they happened to do this mono semanticity research on uh on son doing you little BRS um SJ is constantly screwing up his mic um they they did it on Sonet and obviously they only sh 3.5 Sonic like that's the that's like the The Smoking Gun like if they if they actually had anything else any other trick that caused 3.5 Sonet to be so good uh it would they probably would have deployed it on hu and Opus as well uh the fact that they don't is is prove positive that it's it's basically the mono semanticity stuff does that answer your question do I need to explain what that is I have a hard the opposite I think it's not I think it's not control vectors uh um yeah why they could be well so like they they did say It's a larger model if you look at the training data and the training date for when Claude 3 Sonet came out to 3.5 Sonet it also has a year and a half of significant data updates so I think that there was a lot of research that put out on like high quality synthetic data the post trainining mixture of it and Sonet probably just had a decent bit of like you know there there's a lot more that they could squeeze out of it also they did say it's bigger so like lot more data lot more like research in good quality synthetic data the pre-training like data mixes started to come out um it's bigger I think it was just a lot more post training as well because training there's a sorry in post training in son I mean you can see there's this pause thinking pause tokens where sometimes it generates a token and it kind of does internal thinking and then generates the answer so it seems like they use some recent tricks where um people say hey like you kind of need to think step by step but maybe not materialized directly in the answer the step bystep thinking so uh sometimes when you run Sonet Generations you can see there's there's some it stops in the middle and and people saw that those are actually PA for thinking and yeah and side thoughts so it seems that really helps with with reasoning tasks quite a bit so that's one additional trick that you they use but yeah uh I agree that it's been one year of work so they probably have lots of tricks in there not not just one two three like much like in the Llama paper you'll see that it's hundreds and hundreds of people still less than a thousand but um yeah it's like a Manhattan Project to build one of these things uh I actually counted the number of people llama 3 had core contributors um which is good it's pretty small it's less than Gemini which had 950 uh so uh Sebastian says what does thinking mean okay here here's a here's where we get philosophical um my my quick take on that is this used to be a thing with like the original chat GPT web UI stuff of like why is it pausing at stuff I think some of this is also just the the way the API works right so like what's the inference it's running on how's the streaming what's the API service like sometimes when there's blocks it's not that something else is going on it's just that like there's a delayed stream of your API response and sometimes people over Analyze That is it that thinking is it what's going on um maybe maybe not oh I think for the son's case that some some users have already used prom I guess prom injection to trick it into instead of doing the XML thinking block into use it to Output it in a different format it's then you literally get to see the thinking process yeah so for what it's worth uh we actually I I I went to I CLE and interviewed the uh pause token author um it's it's on the I clear episode if people want to check it out um I do not think that Claude specifically implemented that version of thinking um I I think it's much simpler it it is just Chain of Thought it is just prompted uh you know XML Chain of Thought uh that that is then subsequently post-processed and removed inside of cloud artifacts um yeah and and but it's still it's still a form of thinking it's a form of Chain of Thought um it definitely improves the performance um right sorry yeah that's what I was aware of I'm sorry I missed what Eugene mentioned for like an alternative to what you described as like a chain of a chain of thought that is not presented to the user Eugene yeah so it it's instead of like creating a custom token which is the PO token concept they literally just got the mo true prom thing got the model to to reply with a thinking XML block which then you can trick it through prompt engineering to like substitute the tokens then then suddenly this technique becomes very blatant when when you get to see see respectively because it's no longer hidden from the UI um yeah I think also uh on a separate line regard because since a lot of people are looking to the Evo and then they are like hey some of these evos are doing worse than let's say 40 right uh or things like that right the fact that it's already closed itself right means that you're just like a few weeks away until like every single Benchmark right you're going to see like a a point jump because someone fine tune a Cote specific version of the L3 model or a quote specific or or or a medical reasoning specific version of this model it's going to take slower than normal because I spoke to some people in the fine tuning Community um the biggest hurdle has been what do you mean you need at least three notes of h100 to start the process yeah the the the amount of vram requirement is kind of huge uh I suspect we are going to see more luras first before we get F fine tunes um also the most random part uh I know meta did this for for good reasons uh because they they they basically a lot of Noy word filtering from the sources but a lot of people in the AI companion space they were like no basically yeah makes sense I'm going to look into that thanks um I yeah do we have more things on the on the paper I mean there's there's more to discuss I feel like everyone's being too polite there's a lot of uh new scaling laws they brought up they they had a whole recipe for post training how they did it how they did their sft how they did they also released both the base on the instruct models how much of this was done by synthetic data um how they train their like image video adapters all that stuff for multilingual stuff they like give out a whole recipe on like how to do this and it's a it's a long long read but for anyone that hasn't read a lot of papers this is also probably a really good one that's like a very approachable very readable not too crazy technical one to at least understand what's going on they go into some of their like evals on their multimodality how their adapter work how it performs and they're like yeah it's pretty good they added speech into um speech understanding how to train a speech encoder how many hours of recording they used how they filtered it like they go through literally all of this and like this is probably where like the you know you could have an hour on this paper that's like every step of it um but it's it's an interesting one where yeah they do go into all that data set how they transcribed it like just little little stuff too like in their in their speech understanding section there's a section on like our ASR training data contains 230 hours 230,000 hours of manually transcri greets recording that's SP 34 languages so like you know just a little one line of like we casually manually transcribe 230,000 hours of like 34 languages of speech and we're just training a little adapter for this that we're not releasing so like they put a lot of work into that and then it goes even deeper into like how do you use this for pre-training what about like spoken dialogue how do we fine tune so like what's the recipe for a speech adapter in an llm like yeah we did a lot of pre processing to the base data set of like manually transcribe a lot of speech recording have multi languages train it out how here's like the speech length segments that we want then we fine-tune this like adapter for spoken dialect how do we do that well we synthetically generate responses for prompt we ask like for transcripts we generate them they generate like 25,000 hours of speech synthesis through voice box which is like a whole another series that meta has put out around everything like how did they um how they do like voice generation so like they have a whole really good breakdown paper of of that how they use that to generate model to find tune and generate synthetic data for this so like there's a lot in here if anyone's interested in um a lot of that doesn't make it to Twitter but dig in present it um but yeah architecture stayed the same I thought the interesting parts were also just like they want to keep it very foundational and see what works and what they can just scale up um the the second aspect to that is like it'll be fun to see when they start cooking with like actual Moes how do we like you know go outside the box but high level also like it's it's nice to have Clarity on their scaling laws like I've definitely presented too many times that they just scaled up and prayed and they took an 8B to 15 trillion and like they were very inefficient and then other other papers like 53 took a big shot at this right so 53 is whole paper is about how we had chinchilla scaling then we have like this uh inference optimal llama pre like scaling and then here's how you can do what we think is right and then now llama puts out a pay for and they're like no this is actually all based here's like new scaling that's not just next token prediction it's grounded on reasoning here's how scaling walls work here's how you can use it here's why we did it and it's like it's got to make sense um yeah on the Skilling part I find I found it interesting and funny that they they were using Arc as a measurement for for the the Skilling training uh one of the wheelers take their head in my head it was like so Facebook spent over100 million at the AR challenge to try to win the million dollar prize I think it's a different Arc right if I'm not mistaken it's not the same million the yeah yeah lot of yeah lot of lot of good stuff uh in the five to seven minutes we have left I wanted to give some time to I guess Hassan if you want to Hassan's actually built an app that's kind of cool with uh Lama 3.1 and uh maybe there's something to learn about you know prompting it building with it anything surprising yeah thanks uh thanks Sean hey everybody uh I just want to talk about this this app that I built real quick more definitely a lot less technical than we're talking right now this is dropping all the way down to theer yeah yeah I I guess it is all about building uh but I I just built this little app you know it's um it uses a search API to put in whatever topic you want to learn about like um quantization and it can explain it to you kind of any level you want so let's learn about quantization at an elementary school elementary level uh so it'll basically use a search API grab all these sou

Original Description

ft. Vibhu Sapra, Eugene Yan, and other friends from the Latent Space Discord paper club! We meet every Wednesday at 12pm PT: https://lu.ma/ls

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Latent Space · Latent Space · 36 of 60

← Previous Next →

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Ep 18: Petaflops to the People — with George Hotz of tinycorp

FlashAttention-2: Making Transformers 800% faster AND exact

FlashAttention-2: Making Transformers 800% faster AND exact

RWKV: Reinventing RNNs for the Transformer Era

RWKV: Reinventing RNNs for the Transformer Era

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

RAG is a hack - with Jerry Liu of LlamaIndex

RAG is a hack - with Jerry Liu of LlamaIndex

The End of Finetuning — with Jeremy Howard of Fast.ai

The End of Finetuning — with Jeremy Howard of Fast.ai

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Four Wars of the AI Stack - Dec 2023 Recap

The Four Wars of the AI Stack - Dec 2023 Recap

The State of AI in production — with David Hsu of Retool

The State of AI in production — with David Hsu of Retool

Building an open AI company - with Ce and Vipul of Together AI

Building an open AI company - with Ce and Vipul of Together AI

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Making Transformers Sing - with Mikey Shulman of Suno

Making Transformers Sing - with Mikey Shulman of Suno

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Why Google failed to make GPT-3 -- with David Luan of Adept

Why Google failed to make GPT-3 -- with David Luan of Adept

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Breaking down the OG GPT Paper by Alec Radford

Breaking down the OG GPT Paper by Alec Radford

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

LLM Asia Paper Club Survey Round

LLM Asia Paper Club Survey Round

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How AI is Eating Finance - with Mike Conover of Brightwave

How AI is Eating Finance - with Mike Conover of Brightwave

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

State of the Art: Training 70B LLMs on 10,000 H100 clusters

State of the Art: Training 70B LLMs on 10,000 H100 clusters

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Synthetic data + tool use for LLM improvements 🦙

Synthetic data + tool use for LLM improvements 🦙

RLHF vs SFT to break out of local maxima 📈

RLHF vs SFT to break out of local maxima 📈

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Answer.ai & AI Magic with Jeremy Howard

Answer.ai & AI Magic with Jeremy Howard

Is finetuning GPT4o worth it?

Is finetuning GPT4o worth it?

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Building AGI with OpenAI's Structured Outputs API

Building AGI with OpenAI's Structured Outputs API

Q* for model distillation 🍓

Q* for model distillation 🍓

Finetuning LoRAs on BILLIONS of tokens 🤖

Finetuning LoRAs on BILLIONS of tokens 🤖

Cursor UX team is CRACKED 💻

Cursor UX team is CRACKED 💻

Choosing the BEST OpenAI model 🏆

Choosing the BEST OpenAI model 🏆

How will OpenAI voice mode change API design?

How will OpenAI voice mode change API design?

STEALING OpenAI models data 🥷

STEALING OpenAI models data 🥷

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

Prompt Engineer is NOT a job 📝

Prompt Engineer is NOT a job 📝

Prompt Mining LLMs for better prompts ⛏️

Prompt Mining LLMs for better prompts ⛏️

The six pillars of few-shot prompting 🔧

The six pillars of few-shot prompting 🔧

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Can you separate intelligence and knowledge?

Can you separate intelligence and knowledge?

The Llama 3.1 paper discusses the Llama family of models and various techniques used to improve model performance. The paper covers scaling laws, data augmentation, curriculum learning, and synthetic data generation. By understanding these concepts, developers can improve their LLMs and apply them to various tasks.

Key Takeaways

Read the Llama 3.1 paper
Understand scaling laws for pre-training and post-training
Apply data augmentation and curriculum learning to improve model performance
Use synthetic data for training and inference
Fine-tune LLMs for specific tasks
Train multimodal LLMs
Use voice generation and synthetic data creation

💡 The Llama 3.1 paper provides a comprehensive overview of the Llama family of models and various techniques used to improve model performance. By applying these techniques, developers can create more efficient and effective LLMs.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

The 2026 AI Model Release Race: Every Major LLM Launch You Need to Know

Stay updated on the 2026 AI model release race, including major LLM launches like Claude Sonnet 5 and GPT-5.6, to leverage the latest advancements in AI technology

Call GPT, Claude, and Gemini from one API key — a 3-step setup

Access GPT, Claude, and Gemini through one API key with a 3-step setup using Modelishub

Your LLM Doesn’t Pick Stocks — It Remembers Them

Discover how LLMs remember stock picks rather than making actual predictions, and why this matters for AI-driven investment strategies

Medium · Machine Learning

Word Representation

Learn how word representation works in NLP and its importance in understanding human language, enabling applications like text classification and language translation

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)