State of the Art: Training 70B LLMs on 10,000 H100 clusters

Latent Space · Advanced ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations90%LLM Engineering80%Fine-tuning LLMs70%

Josh Albrecht, CTO of Imbue, and Jon Frankle, Chief AI Scientist of Databricks, dish on what it takes to train the largest models on the largest clusters... including fighting Infiniband Porch Pirates

What You'll Learn

The video discusses training 70B LLMs on 10,000 H100 clusters, covering topics such as cluster setup, health checks, and performance debugging, with tools like Megatron, DeepSpeed, and Ken from Uber, and concepts like large-scale LLM training and distributed systems.

Full Transcript

welcome to the lat and space podcast another super special edition today we have uh sort of like a two- header uh John Franco from Mosaic data bricks or data breaks Mosaic and Josh Alberts from MB welcome hey glad to be here yeah thank you for having us hey um so both of you are kind of past guests uh Jonathan you you actually one of the most popular episodes from last year um talking about npt 7B remember the remember the days when um we we TR large models in 7B yeah back when reproducing llama 1 7B was considered a huge accomplishment for the field those are the good old days I miss that um so things have accelerated a lot um actually let's let's do a let's do a quick catch up and Josh you can you can chime on chime on in as well um so database got acquired I I talked to you at mosaic got acquired although sorry although sometimes it feels like Mosaic acquired data bricks because you know we're having a lot of fun being here but you know yeah yeah I mean you are chief chief scientist now of data baks Chief AI scientist careful careful with the title I as much as I would love to understand how spark Works um I'm gonna I'm gonna have to defer that to much smarter people than me got it um and you're uh I I don't know about like you know how like what you would highlight so far as uh postacquisition uh but the most recent news is that you guys released dbrx is that the the thing that most people should be aware of actually that's no longer the most recent news um honestly the most recent news we we announced this but it was at our data and AI Summit last week so it was announced among like 100, other things um is that we finally released our text to image model um which has been a year in the making through a collaboration directly with shutter talk um there was a lot of work put into finding a data set that we were comfortable with working on um and trying to build a model that honestly I felt like I could trust and that others might be able to trust to put out in the world so that model was released last week um it's unfortunately just available via API um due to the fact that you know the data is you know quite sensitive and quite valuable it's shutter Stock's entire business in a lot of ways but I'm still really excited that there's now a model that is trained on a data set where the Providence of every single image is known and it's a damn good model so I'm really proud of the team on that yeah amazing um Josh you have any thoughts on image model uh questions that is not my expertise but I'm very uh I was excited to see the the release of it last week as well and very happy that you guys did uh a nice job on the data Side of Everything there so that was cool to see I think what's unusual is like I think shutterstock's doing multiple deals in multiple Labs so what is the Shutterstock model like I guess is this the house model for Shutterstock is this datab Brick's version of the Shutterstock model like what is this the way that I would think about it is that like Shutterstock is doing an amazing business in AI across the board their their data set is kind of widely known to be you know the best stock photo data set in the world the most comprehensive the biggest like it's you know when you when you think about like what data set am I going to train a multimodal model on you call shutter stock um and you know I at least I've heard in the news like opening I Google meta um Apple have all called Shutterstock and made those deals um so a lot of models have had Shutterstock data incorporated into them but this is the only model I know of so far where it was you know exclusively and specifically trained just on the vanilla shutter stock data there was nothing else mixed in you know we didn't we didn't go and scrape the web and find other data or combine data sets or anything like that and so this is in some sense the house blend um but the other piece is that it's just a data set where the Providence of every image is known in public like where did the data come from it is the shutter stock collection that's it um you know nothing less nothing more and certainly being at data bricks if I've learned one thing it's I've learned about Enterprise customers and what they want out of AI and one of the things they ask for most is just what can you tell me about the data the model was trained on and here especially for text to image models where images are just tricky subject matter there's been a lot of kind of legal conversation about images especially it's nice to just have something where I can point to it and say you know you want to know where the images came from these are what they are and this is how they got there I will talk a little bit about data breaks because it's relevant to the rest of today's uh episode um so data breaks uh so sorry uh I keep Mis I keep misspeaking it's dbrx dbrx actually um there's been a pronon update it is now DB Rex um so we have decided to add a dinosaur mascot because what model doesn't like a a mascot so literally I wish I could pull it up there is a little plush dinosaur um that we had made it's like the world's cutest dinosaur um but it is the official mascot of DB Rex and there's a little dinosaur logo that you know you'll probably see around a little bit more CU I mean dbrx is mouthful but DB Rex like you know it's just kind of rolls off the tongue uh I love mascots I think every every company should have a m mascot and I think hugging face got it right you need an emoji mascot because that's the minimal viable image I probably shouldn't talk at all about you know Velociraptor but you know that's a maybe maybe that's something we can talk about later in the summer I'll just leave it at that okay that's a hint to to names I feel like your your names leak a lot of alare um so so just to just to quickly cover the the headline details um DX DB Rex uh is make experts model uh that's fairly big 132 billion total parameters with 36 billion uh active on any input pre-trained on 12 trillion tokens of text and code uh and did really well on evals uh to the point where you had to dye your hair blue that's that's my highle conclusion about never make a bet with your team two weeks out from model launch even when you know human eval is looking quite bad um because if you set some bar even if it's arbitrary and you think there's no way in hell they're going to hit it apparently money doesn't motivate people anymore um humiliating their boss motivates people so Josh you should really take a hint from this um you know you cannot pay someone enough money to make up for you dying your hair blow I'll keep that in mind for our next model totally it works uh So speaking of imb's next model um perhaps Josh you want to you want to actually just say hi to the the general sort of L and space audience and talk about what we're releasing today yeah I'm Josh uh CTO evmb and uh we're not releasing the model uh we're not releasing the weights but we are releasing a bunch of different things that should make it easier for other people to make their own models so I think right now training Foundation models from scratch is like a very difficult timeconsuming uh expensive kind of risky Endeavor especially for smaller companies uh and the things that we're releasing hopefully make that at least a little bit easier so the things that we're releasing fall into kind of three different buckets uh one is infrastructure and like scripts for dealing with the kind of hardware and like you know Hardware failures and like understanding how well is the actually lowest level of things actually working so you can actually do your training at all and at a reasonable Speed without having to constantly restart Etc so infrastructure and training scripts uh a second set of things uh is around the evaluation so after you've trained it like how well is this actually working and how do you know how well it's working or releasing a whole bunch of uh different data there a new Benchmark about code reasoning understanding as as well as our own private versions of 11 different open-source uh benchmarks so things like pool Q or anli where we've gone through and kind of cleaned up the data as much as possible by looking at all the ones that models get wrong or that are flaged ambiguity uh and also our own kind of private reproductions of those where we've done like a kind of clean room Black Box like okay this is what the data set is supposed to be here are some examples let's make our own version of this to make sure that there is no data contamination Etc to make sure that we're actually you know not um testing on train uh and then I think a final thing that we're releasing there is around 450,000 human judgments about ambiguity and question quality which we used in the process of cleaning these evaluations and we also hope we'll be helpful for other people training kind of similar models and then the third thing is carbs are hyperparameter uh our cost aware hyper parameter Optimizer which was especially helpful for being able to experiment at much smaller scales and then scale those experiments up to the much larger scale kind of on the first try without having to retry you don't want to be training you know 10 20 different 70b models you really want to get these larger models right on the first try and so the ability to kind of tune things very precisely and learn scaling laws not just for you know the like data and uh and and flops but also for learning rate and all the other hyper parameters and see like how should you scale these things up uh was extremely valuable to us as as we were training the larger models yeah lot of stuff yeah exactly so there's a bunch of stuff we'll have to go through all of it yeah I I just want to throw in how excited I am about this um this is a stuff that nobody ever talks about that is the difference between success and failure in this stuff like can you get your cluster to run can you get software on your cluster can you figure out what broke because fault tolerance is still not really built into any of the fundamental Primitives of training models and so if something breaks you have to go figure out what broke your job stops you have to restart your job it is a nightmare just to get to the point where anything can train on the cluster a basic MPI hello world that has the gpus talk to each other is hard enough let alone actually training a model let alone getting good performance out of the gpus let alone actually getting a model that converges to anything interesting like there's so many levels of things you have to accomplish um like this is the kind of stuff that matters um you know I think to a point that Josh made earlier you know before we got on here there are plenty of Weights out there nobody's released this yeah that that was part of the motivation actually is that like there are are lots of other things that are complimentary but I have not seen nearly as much discussion about some of these other things that we think are prettyy important I mean in some sense I'm very excited to have Jon Jonathan on uh because this is a little bit uh you're you're bread and butter U with MZ and U you know I think you've released some part of with with compose or composer and uh I think it's just you know really really interesting to see like and in different take uh like a basically a full stack um take that's kind of open sourced today yeah it's it's really kind of it it's been an ordeal to figure this out and every time something changes whether it's a new GPU or even a new driver update um you get new creative errors and new things go wrong and you know we've dealt with the weirdest things from you know our infiniband cables getting stolen from the data center twice like in boxes before they arrived at the data center like you know porch pirate basically had stolen our infin band cables back when those were hard to come by um to like you know weird recalls of switches to like the the strangest stuff has happened I have my favorite GPU failures I've seen like ones where the GPU doesn't fail it has a correctable memory issue um and the memory correction causes the GPU to become a straggler and hold up the whole job um like weird stuff happens and figuring out how to not just identify all of that but then eventually productize it is in some sense the entire story of Mosaic and now data bricks in terms of our ml offering really the thing you know the thing we offer is we have we have gone through this suffering and figured out how to even productize that it has been a pain in the butt yeah it's a lot of work I think my favorite failure was uh GPU is just giving wrong math like if they give errors great because you can see the errors but if they just give you the wrong math back not so button when did they give you wrong like literally you could just you know add two things for example the numbers come back they're not the numbers that they're supposed to be I I think it's important to say at this stage just because like it I think it goes without saying for Josh and I but it's worth saying here this isn't to say that like anything is wrong with this it's not like Nvidia did a bad job or you know melanox did a bad job or the like the server Builder the data center operator the cloud provider like the million other parties that are involved in building this we are running these insane chips that are huge and complicated and built on Tiny transistors at insane frequencies with insane heat in data centers that for the most part were not built remotely for this kind of power or heat and have been retrofitted for this m like failures happen on a good day with normal CPUs and this is not a good day and not a normal CPU for the most part so it's you know it's fun to joke about all the weird we see this is not to say anybody's done anything wrong this is just kind of part and parcel of working on a massive cluster running at multiple megawatts of power at a time it's crazy yeah so Optical like all sort like everything I'll take the opportunity to start in go into the sort of infra piece um so there's there's just like a description of the infra just to give people a a sense of what we talk about when we talk about massive clusters uh so I'm just going to read off the blog post here um it's this post is about one cluster that has 4,000 92 h100 gpus spread across 511 computers um they use UniFi fabric manager nodes um which manage the infinite B infinite band Network and you talk a little bit about your networking um is there anything unusual about this setup that you called called out to people yeah actually this particular cluster is a little bit non-standard the normal the like vanilla setup for you know these large clusters as vanilla as it can be is what's normally like a 27 node uh cluster so closer to like 1024 gpus instead of 4,000 here we have a larger cluster as you start to get into larger clusters the networking becomes a little more custom it's a little bit more it's a little bit trickier it's a little bit uh more difficult to get these things to to all be able to talk to each other at the same speed uh and so this has uh in this particular case is the three- tier Network architecture instead of two tier is kind of the normal one so most of the Clusters are a little bit smaller as you get to even larger scales then it becomes this becomes even much more complicated much more expensive so we chose this particular scale kind of knowing our own workflows and kind of what we wanted to do uh this was kind of the right size for us but uh yeah I think it's it's you know it's not exactly theilla already it's already getting into kind of the custom territory yeah is this uh is there any so my my understanding is that there uh and for the for what it's worth I don't know if this is on the record or whatever but you can just tell me to strike it um uh is there any is is there any part of this that comes with the voltage Park deal that you guys had um um is like is that is that part of uh the the hardware that you got from the deal with them yeah so we worked really closely with voltage Fork to set up their all their clusters and infrastructure and everything and kind of decide even like what to order how like how should the networking work like we were very involved in kind of the construction and bring up of this and that's what this post is about is about that process of like bringing up all these there's like different clusters in different places of different scales so in this particular post we're talking about this one 4,000 96 GPU but there are other clusters that they have as well uh and we were very uh closely involved with figuring out the exact architecture and kind of the trade-offs that go along with picking you know those exact components you really don't want to like place the wrong order because it takes months to get it and it's very expensive so uh yeah we're happy to help tables get stolen yeah yeah exactly we wanted to make sure that we ended up with compute that would work for us uh and that would also work for their other customers and so we kind of helped design something so that we get exactly what we were looking for we knew that these kind of details would be super important and that getting down to the level of the hardware and like having these good scripts and everything was going to be a core part of like actually getting this to work I'm very glad that we did that uh I don't think that most companies kind of take that like you know full stack approach but for us it certainly paid off uh yeah it's it's basically s of built a spec uh it's it's interesting that relationship because you usually uh for the rest of us uh who don't operate at your scale we we take whatever we can get from cloud providers but but you you are basically co-designing from the single machine up MH um and you describ that a little bit um you want to take us through the process that you described here yeah so for the actual like the blog post and kind of bringing bringing these machines online yeah yeah um so yeah I think the process as we have it broken down in the blog post there's kind of a few different layers first is like getting the individual machines to work at all and then getting the machines to actually be able to talk to each other so getting the infin networking to work and then getting to a point where you know not just the machines are working and they can talk to each other but everything is actually working correctly there's a big gap between like it's working at all to it's working perfectly correctly and then after you have all the stuff working perfectly correctly uh nice and healthy then now you get into kind of the software data like training issues and then after that you're still not done like now even once you're training at full speed things are going to fail over time things are going to change there's going to be new you know firmware updates like how do you kind of deal with this change and flux over time without going crazy and pulling your hair out trying to like reproduce things or understand why there were regressions and so there's a lot of work to kind of automate the infrastructure tooling as well um I'm kind of the first step like bringing these things online in the first place uh you know you have hundreds of machines at this point so you don't necessarily want to be like walking around with like a CD ROM or a USB drive like plugging it in with your you know keyboard like hitting next next next on the OS install that's not that's not how this works you do that for one machine uh and then you use uh we use this thing called metal as a service uh to bring up all the other machines so it's a kind of server that can kind of install the operating system on these other machines so most like when you're talking about these machines like each machine is you know on the order of hundreds of thousands of dollars so they usually come with a kind of outof band management uh interface as well so they don't they have their infin man networking they have their normal 100 gbit per second ethernet networking these like dual redundant Etc and then you also have this extra out of band management Network so you can log in and you can see like the boot screen or you can see the blue screen of death you can like get in there and actually see what was wrong which is pretty fun and it makes it like possible to automate a lot of this work so the beginning of that and the blog post goes into much more detail about like exactly how we set these up and kind of the other uh eras that we ran into when you're bringing these online you'll definitely have failures even if they all worked in the factory they get shipped some parts come loose something fails something goes wrong so when you're bringing them online there'll be some that don't quite work for all sorts of reasons as you start to be working at with machines at this scale like you know if something happens one in a thousand times you're like pretty likely to see it uh and so it can get pretty rare weird things especially since we had fairly early builds and fairly early versions of this Hardware like the some of these are some of the like first machines that were ever produced some of the first gpus so you got some extra special uh things there we definitely worked with Dell for example on making fixes in the firmware level to be like okay like this thing is wrong like we need to update this at the firmware to like actually fix this particular thing uh so we worked pretty closely with and Nidia yeah that's what I'm saying like this stuff gets complicated and the thing is like you know taking a step back the whole reason we're doing this right is that we knew that this was going to be complicated there would be these kind of failures and if we're just using you know AWS or some other cloud provider these errors are still going to be there and you're going to have no way to know and no way to debug this and no way to diagnose what's going wrong and so we would much rather be able to like call up Dell and say hey this isn't working and they're like yep okay cool see buug get together oh I see yeah cool we'll ship a firmware update and actually fix this for you that was a much better experience and like great just magically fails I guess we restart and hope that that machine goes away like that's not a very good place to be um so yeah that that's kind of the first place is getting to a place where like GPU training is working on your single n machines you can observe stuff we have tons of tooling around like you know Prometheus and and all sorts of other uh tools for understanding what's going on in these machines because you don't want to be like logging into each one and looking at the temperature or something you really need to have tooling to collect all these metrics Etc unfortunately all of the scripts that we have for this are like for this entire cluster and for all the simple structure are a little bit like special purpose for our particular thing so it's not that every script that we have it's not you can just like take this and plug this in even if we did open source all the tooling that we have you'd still have to do like a lot of work to open source it what we are releasing is as many of the things that we can that are going to be useful for other people you're still going to have to have some way of kind of managing these things making your own like logging aggregators etc etc so that's kind of bringing them up to the like you know the single nodes are working from there it goes into I'm happy to keep going if you want well I I just want to leave the opportunity for John to uh to comment if if there's anything that's different from how he runs things oh I mean all I'll say is I'll endorse this and say this is hard uh like this is really really hard and you know I have a special props to you know the folks that in view because they were building this from the ground up um you know at datab brickset mosaic we typically work with Cloud providers um because some of this stuff is just there's too much to handle it's complicated there's a lot to deal with and this doesn't even get into things like physical security um you know securing power if you're the data center operator like this gets infinitely complicated um and you have to abstract somewhere like you know and then you get to the folks who are literally building their own custom chips and like good God like oh my God that's you know if you're one of those folks you're having you know 4 one out for the the infra people at some of the AI chip startups who are having a really really interesting time right now um but this stuff is really hard and I don't think we talk about it much because there's so many other things that are hard um but the other hard things I think everybody's becoming pretty familiar with at this point the this is something that I don't think there's ever really been a comprehensive discussion of at least not that I've seen yeah so my impression is that you guys uh Mosaic have uh your own software for for sort of spinning up and down machines just like uh imbu had to build but uh imbu probably it sounds like you guys went um uh Fuller stack I I don't know how I don't know how to describe it like like Mosaic is not working with Dell on like their firmware no no we're we're typically working with like you know your cloud provider on their Dell firmware or what have you like it's kind of I think I think one of the things I don't know Josh you can correct me on this it's kind of impossible if you're doing training to not go all the way through the entire stack regardless of what happens like somehow I'm still chatting with Cloud providers about power contracts even though the whole point of dealing with a cloud provider is not to have to think about power contracts somehow I'm still asking them about which infin ban provider they used this time um to see if this is part of the bad batch of cables I encountered on that cloud provider or what have you or like we're still talking about a firware upate from pick your provider like you can't not do this it's convenient that they have data center staff who worrying about what to send back to which provider when and they have people who can go and wait for the infin band cable so they don't get stolen outside but you know it's kind of it's impossible not to really go full stack if you're thinking about the infrastructure at all I don't know Josh correct me no I think that's right that's what we expected from the beginning as well is that we would have to get inevitably have to get into the details here and I'm glad that we kind of just planned for it I think it made it a lot easier from our perspective to have direct control over this instead of having to go to the cloud provider that goes to the data center that goes to the supplier we could just go direct to Nvidia or Dell or the data center whoever was responsible and be like hey this thing needs to change and they're like okay yeah that is our responsibility great we can fix that so it was just a lot easier for us to fix these bugs then if we had to go through an extra layer of of email uh something we discussed in the pre-show was that you had a Ru of thumb for uh your your cluster of reliability uh you say here in the post by and large you expect around 3% of your machines to break every week um so you're basically going to turn through all your machines in a year um as it says in the post so that would be true if uh it was a uniform uh failure like that but uh as it says in the post like it's usually these kind of problematic nodes and to be clear that is the number that we've heard from other people is like they're having about 3% I don't think we're experiencing failure rates that are that high I think ours is actually quite a bit lower than that probably because we've taken the time to like dig into a large maybe larger number than we should have of these failures and get to the root cause of it and be like oh okay like that's exactly what's going wrong how do we fix this how do we prevent this from happening how do we make automated checks for this so that if it does happen it just goes back to the whoever uh owns that particular part of the process and they can fix it immediately uh and that's part of what you're ALS open sourcing which is the health checks right right you got the Nick health check GPU health check this space health check Docker D message I don't know what that is that one that one is just just a lot of stuff yeah that one is one where we realized that actually like when these machines boot sometimes they wouldn't actually boot cleanly all the way or when they rebooted they had problems that they didn't have when they were working before which was kind of frustrating like usually if you restart your computer gets better here you restart it did not get better it got worse uh that was very frustrating so this health check looks at every particular line we've ever seen from the boot uh like in D message like every single log line that your computer emits and says like have we ever seen this before is this expected is this in the right order or is there something out of place if there's anything out of place then we say okay great like now it goes into this like longer more triage list of like all right great like is this acceptable should we flag this like should someone take a look at this so we're looking down at a very very granular detail level what's happening on these computers to make sure that nothing is out of place and that's critical because without that if you're running your training as Jonathan said and you're this thing is slow like what are you supposed to do right like you really you really want to be very certain that like all 4,000 of these gpus are working like they're supposed to we know that and so if it's slow it's because like we messed up the config or something else and not because of this earlier thing that's like really hard to detect in software later yeah I think they I'm just curious to ask like you know suppose you were to set up another let's say another h100 cluster and it were at a different Data Center and instead of the vendor being Dell it was super micro or what have you mhm how much of this would be repeatable and how much of this would you have to redo I you know I genuinely don't know a decent amount I think it would go a lot faster the second time I think there's lots of learnings that we had and also the blog post you know yes we are releasing the health checks releasing some scripts but a lot of the valuable stuff is also in the blog post itself in the details and kind of the you know the learnings that we've had and the sort of eras that we run into we tried as much as possible surface those so other people could learn from those and avoid the same mistakes or failures as well but I think it go a lot faster although yes there would certainly be some things that be a little bit different um I mean there' probably be different CPUs or whatever but I think a lot of that stuff is less um it's less that's the like that's less variable uh I think most of it would apply the second time around although I'm sure next time we're building one it'll probably be you know at a scale as 10x as big with a different chip or something like this and then who knows yeah with connect X8 that will have its own fun behavior and all that good stuff yep um perhaps uh something that people don't discuss about and you don't even talk about this in the blog but I always wonder is what is the timeline that's like kind of reasonable for this amount of work um at least at least the initial stages and also what what does the team composition look like uh for setting up a cluster right like what are the mix of skills that you typically would require um to to to get all this going I'm I can't really speak to typical one thing I am very proud of is how much we accomplished with such a ridiculously small team like our infrastructure team is like you know fluctu weights from week to week depending on like how many things are on fire and how much we need to build but it's like between like three and six people like it's small it's not like some huge team of like tons and tons of Engineers and but those people are very very good at what they do uh and so that has allowed us to get a lot of mileage uh out of out of these things I think it's not that we're building everything right it's not that 3 to6 people build this whole thing I definitely want to like you know say thanks very much to Dell and 5 and and Nvidia and the other people that have done a lot of the work like to bring up this cluster uh you know with 4,000 gpus and three-tier networking networking architecture you have 12,000 cables so that's 24,000 things that need to be plugged in like that's just a lot of stuff to plug in right and you don't want to mess it up like each one needs to be done correctly like it's a little bit loose like it doesn't really work if you break it you need to replace it like there's a lot of work that goes into this uh yeah and then you know that's just like that's it that's if you were to do everything right the first time and if you didn't have to fix anything but inevitably you know you will have to replace something which means like taking all the wires out pulling the thing out taking all the gpus out going and fixing some cable putting it all back correctly putting it back in doing this every time like there's a lot of work that goes into it so there were a lot of people at Dell Nvidia and at H5 that all helped a ton with this stuff yeah I don't know the exact size of the the Dell team it also fluctuated over time yeah excellent um and then you you know you you um so you have all the hardware set up and now you're firing it up for single node um there's a long description that you that you guys have about just like um monitoring the mfu right and and um what each situation might look might be indicative of um one of the most interesting things to me that I that I saw from from here is like you know if training immediately starts off at 60 to 80% mfu something's wrong um um but like you know like what what um are like you know some anecdote or uh you know notable scenarios here that you might you might call out as maybe counterintuitive or super interesting I mean there's there's just so many of them I mean one of them which I think is probably pretty common uh like common knowledge by this point but like we did have a sort of like uh which one was this exactly I think for the mfu like gradually getting worse over time I think that one when we saw that the first time we're like what the heck is going on like why does it get just like a little bit worse this is so strange like what is it getting lazy or tired or something like is it heat like what's going on and in that in this particular case it was memory uh fragmentation because you have hundreds of machines they're doing garbage collection slightly different times and then they get slightly further apart and slightly more and more jittered until eventually they're all happening kind of random times and just like really messing up each one of your steps uh so you just turn off garbage collection and call it a day basically to be honest there's other things you can do if you want to be a little bit more sophisticated about it but you you can also just manually have it all garbage collect on some interval like that that's what we've done we just have a garbage collection call back that just runs but I've seen the exact same thing yeah yeah exactly so I thought that one was kind of fun and we did Trace that one down and look and we did find the actual call like again this goes to like having good tools so we had really good tools where we could look at a bunch of like actual traces in C and be like okay cool this is the thing that's taking a lot of time or like you know this is the thing that doesn't quite line up here like oh I guess it's garbage collection okay cool interesting yeah let's just try it off okay great that's what it was now we can fix it uh so for each of them like basically bugs are not hard if you have good tools but if you don't have good tools bugs are very can be very very hard so similarly for like heat another thing that we saw was like oh you know the CPU is getting throttled okay well it's easy to see if you're monitoring the CPU throttling or monitoring the heat if you're not monitoring that it's really hard to know why it's just suddenly one of them is going slower I I noticed also in in a piece that you mentioned fsdp with 03 um actually we met um I went to ICL and uh gu from the DP team was was there presenting Zer Plus+ I was wondering if um he wanted make any call outs to uh you know particular open source or open library or open whatever implementation uh teams that was super helpful in your process um I think we ended up actually pulling from a whole bunch of different ones uh to pull things in into our own particular pipeline so we use things from nvidia's you know Megatron stuff we use stuff from probably deep speed I think we we pulled in a bunch of different pieces from a bunch of different places so it was really nice to see all these working open source um like examples I think I really appreciate all the effort that has gone into actually tuning these things because you can tune them but it's a lot of work to to like tune this stuff and do all the stuff from from scratch it's really nice to have like a working example I think those are probably the two biggest ones deep speed and Megatron LM but there are probably other ones as well is there is there a particular thing in the ecosystem where you would call out as like there should be something here that is open source but like it's not really uh it's like like everyone kind of builds it on their own H I want to say something with the file system because everyone talks about the file system eventually the file system actually was I I mean we did something kind of dumb there uh like we have our own sort of local mirror so that we can you know like a crappy version of S3 that's local but it's just a pretty simple script right like I think we run like a little web server that just like serves files and then you know can upload them and download them okay great and part of the reason we did that is that our internet connection in the beginning was not the like full speed one that we would eventually have and so we are a little bit more kind of bottlenecked in terms of Internet bandwidth uh and so we had this I think we looked at a bunch of uh Services out there like Mino and some other ones but a lot of these like uh come with a lot of extra overhead in maintenance and since we already have so much infrastructure to deal with we kind of didn't want to you know bring in a whole other like cloud provider virtualize something something we just wanted something simple so we went with that um which is which has been quite helpful like the our tools are usually quite simple it's like bash and Python and SSH and Docker like we like to keep things simple so it's easier to debug like less layers of INF less layers of abstraction make it a lot easier to work with like we don't use kubernetes for example I would just directly launch these things and it's just been much easier to debug this way one one tool actually that does come to mind that I will call out is um uh kren uh from Uber that was great we love that Tool uh we were a little bit skeptical I'm sorry yeah so Ken is yeah it's a distributed uh like Docker registry basically that uses bit torrent to like transfer things between the machines and a sort of nice optimal way like in the very beginning the naive way is like you have this one Docker registry which was outside of the cluster so every time we change an image you know there's many gigabytes that each of the 500 machines needs to download so that just takes a really long time so what this thing does is like just one of them downloads it and then like they all sort of broadcast all the pieces to each other uh and it was just like a really nice fast way of getting these uh images down and it was very robust like there's a lot going on under the hood but I think it's a pretty cool tool that we haven't really had any bugs with it at all amazing um yeah I mean that that's all my questions I guess for the infro piece uh I don't know if John you you had uh something that you sort of burning to ask or I know all I can say is just same um in a lot of senses like there done that seeing this plus one um I think the one big difference you know perhaps in philosophies is we've tried to basically standard our on as much commodity stuff as possible just because you know I think the reason I asked about trying to do this on multiple different pieces of infrastructure is like I think we're running on like six or seven different clouds right now and everybody has done something slightly different and my gosh the little differences add up as you know you've seen and so you know our philos has been like okay whatever the hell we can standardize please let's standardize it like vanilla offthe shelf fstp and like you know we wrote Our Own data loader but we've tried to make that as much of a standard as we can across our infrastructure and in data bricks because things just start getting really complicated or like we use kubernetes extensively because it at least gives us a uniform set of apis like that's our Hardware abstraction layer to a certain extent for everything else um so it's just you know a difference in philosophy there but otherwise like yeah this stuff is really really hard um and I feel like we take for granted how much of this you know is done for us when you go and you just query chat GPT for example like oh my God everything going on underneath that you know it's kind of a miracle that the the machines boot up let alone that you can like query a giant language model that's probably doing inference across multiple machines and was trained across thousands of machines like you know minor miracle yeah it it's an awesome amount of power that we invoke with a single API call that we take for granted these days it's absurd um yeah I mean like kubernetes like uh that point about kubernetes I will say as a former AWS employee like uh it it seems like it would be uh ideal for IMB to at some point make it more abstracted or agnostic uh because you're you're you're going to want to you know replicate your setup we do have our own sort of replacement it's just a much simpler version of kubernetes kubernetes are really designed for running services not for running experiments like that's its like main architecture and so for us like we have everything that's like cool you're going to run an experiment so you wanted it to run to completion right okay great like The Primitives are sort of built around a slightly different style and that makes it a lot easier like just a lot simpler to to fit the the nature of like these machines are going to disappear they will need to be rebooted for infrastructure upgrades they will like something will happen to the gpus failure is like baked into this as like a core part of our infrastructure so it's not that we don't have an abstraction it's that it's a sort of simpler more tailored abstraction for the particular work that we're doing yeah I think it all depends on what your goals are and like I think the challenge in a lot of the deep learning stuff right now is that people are trying to like people often build things that are more complicated than necessary to get the job done and complication is the enemy of everything um you know don't use a fancier parallelism strategy than you have to don't use a fancier set of libraries than you have to don't do anything that you don't have to do um because it's hard enough as it is like don't over complicate your own life y try to bring in more tools or more fancy architecture tweaks if you absolutely don't have to like getting to the minimum amount necessary to get the job done um and it's really tempting to want to try to use everything um so like I totally understand that one I think the the last piece I I'll maybe call out is that um I'm just going to weave this in just because I I see the opportunity to do it are there any infrastructure shifts that need to be uh that that that need to rise because of changing AR architecture um so I think for example um in VI you like you're announcing a a dense model the 70b dense model uh whereas uh John just worked on dbrx and and the sort of image DET Texs uh the text image model uh which presumably has different bottlenecks that's correct for us um you know we we train both D and and mixture of expert models the one we happen to you know kind of get permission to open source was a mixture of expert model and those models are very demanding when it comes to network bandwidth at least if you're training them in kind of fstp 03 style where there's just a lot of parameters getting shuffled back and forth and your ratio of kind of compute to amount of data that you have to shuffle back and forth becomes a lot worse because you're now you know you're only using a fraction of the parameters for every token instead of all the parameters and so we had to really push the envelope on getting all the stuff to the right places on time and and so actually the networking part of dbrx was the single hardest thing I think of the entire process just gete training working at scale across a big cluster um we still managed to I think do it all with commodity Parts which was very exciting um you know the like we were using fstp and we eventually used hstp so that we could have hstp as a version of fstp where you have multiple smaller replicas um and you're doing data parallel within those replicas and that helped a lot with network Laten issues that we were running into just because we were transmitting so much data um you know for every single part of the process I think it actually like it was instructive for how Google designs their hardware and software together personally their training as far as I understand using kind of a 03 style of training and have been for a while they also train mixture of expert models tpus have a very different network bandwidth to compute ratio they have a lot more bandwidth um just objectively and tpus per chip tend to be a little bit less Compu intensive have a little bit less memory um you know it's just a different design choice so the ratio of flops to to bandwidth is very different and that means that it's much easier for Google to be able to pull off some of this stuff they also have interesting you know tourist Style Network architecture or tour style like literal Network architecture is not like the model but the network um that is this the sort of block attention I forgot what you what do you call it so this is this is just more or the yeah this is more not the ring attention but these are the ring all reduces like you have three different dimensions of rings because they they kind of put together these threedimensional Tauruses from what I understand and so like you know Google's infrastructure in some sense is kind of I wouldn't say built for this but maybe the way that Google trains models is built for a slightly different bit of infrastructure they have and it's kind of neat to think about that you know as as one thing that I think Nvidia announced for you know for for both the gh2 and the gb2 200 is this hybrid networking where you'll have blocks of nvlink worked chips I think for the gb200 I think it's like groups of 72 gpus will all have NV link to each other so higher bandwidth then you'll have normal networking of some kind inin band or rocky or what have you between these blocks and that's kind of a you know it's a change due to the fact that you know it's hard to build really high bandwidth networks over very large groups but it is now a blocked networking and you have to think about how you architect your model and your parallelism differently you also have to think about fault tolerance differently because it now matters where you lose a GPU whereas it didn't before so you know it's it's it's just all really interesting and really fun speaking personally but it's going to mean new nightmares when we all move to that generation and have to think about you know new versions of these problems as you go up to larger scales it gets quite different like right now you know if you're experiencing let's say for example you experience a GPU failure every day that's fine just restart if you make your thing 24 times as big now it's once an hour uh now it stops being quite as easy to just restart right so now you have to kind of break like fake in this sort of redundancy that you didn't have before so I think as you go up and scale you end up running it into like a lot of uh really interesting problems that also inform the uh the actual like design um yeah I mean as an orchestration guide this is why I I always emphasize like very cheap storage or very fast storage so you can checkpoint more but I don't think that's probably not the best solution um to to to for fast uh you know training which works fine when you're doing language and then you move to Vision or video and then you know you have multi petabyte data sets and getting you know cheap fast multi petabyte storage starts to bite like I've certainly encountered issues where the literal data center where my gpus were did not have enough you know Object Store to fit the data sets that people wanted to bring into that data center from whichever users were

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Latent Space · Latent Space · 33 of 60

← Previous Next →

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Ep 18: Petaflops to the People — with George Hotz of tinycorp

FlashAttention-2: Making Transformers 800% faster AND exact

FlashAttention-2: Making Transformers 800% faster AND exact

RWKV: Reinventing RNNs for the Transformer Era

RWKV: Reinventing RNNs for the Transformer Era

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

RAG is a hack - with Jerry Liu of LlamaIndex

RAG is a hack - with Jerry Liu of LlamaIndex

The End of Finetuning — with Jeremy Howard of Fast.ai

The End of Finetuning — with Jeremy Howard of Fast.ai

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Four Wars of the AI Stack - Dec 2023 Recap

The Four Wars of the AI Stack - Dec 2023 Recap

The State of AI in production — with David Hsu of Retool

The State of AI in production — with David Hsu of Retool

Building an open AI company - with Ce and Vipul of Together AI

Building an open AI company - with Ce and Vipul of Together AI

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Making Transformers Sing - with Mikey Shulman of Suno

Making Transformers Sing - with Mikey Shulman of Suno

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Why Google failed to make GPT-3 -- with David Luan of Adept

Why Google failed to make GPT-3 -- with David Luan of Adept

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Breaking down the OG GPT Paper by Alec Radford

Breaking down the OG GPT Paper by Alec Radford

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

LLM Asia Paper Club Survey Round

LLM Asia Paper Club Survey Round

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How AI is Eating Finance - with Mike Conover of Brightwave

How AI is Eating Finance - with Mike Conover of Brightwave

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

State of the Art: Training 70B LLMs on 10,000 H100 clusters

State of the Art: Training 70B LLMs on 10,000 H100 clusters

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Synthetic data + tool use for LLM improvements 🦙

Synthetic data + tool use for LLM improvements 🦙

RLHF vs SFT to break out of local maxima 📈

RLHF vs SFT to break out of local maxima 📈

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Answer.ai & AI Magic with Jeremy Howard

Answer.ai & AI Magic with Jeremy Howard

Is finetuning GPT4o worth it?

Is finetuning GPT4o worth it?

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Building AGI with OpenAI's Structured Outputs API

Building AGI with OpenAI's Structured Outputs API

Q* for model distillation 🍓

Q* for model distillation 🍓

Finetuning LoRAs on BILLIONS of tokens 🤖

Finetuning LoRAs on BILLIONS of tokens 🤖

Cursor UX team is CRACKED 💻

Cursor UX team is CRACKED 💻

Choosing the BEST OpenAI model 🏆

Choosing the BEST OpenAI model 🏆

How will OpenAI voice mode change API design?

How will OpenAI voice mode change API design?

STEALING OpenAI models data 🥷

STEALING OpenAI models data 🥷

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

Prompt Engineer is NOT a job 📝

Prompt Engineer is NOT a job 📝

Prompt Mining LLMs for better prompts ⛏️

Prompt Mining LLMs for better prompts ⛏️

The six pillars of few-shot prompting 🔧

The six pillars of few-shot prompting 🔧

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Can you separate intelligence and knowledge?

Can you separate intelligence and knowledge?

This video teaches how to train 70B LLMs on 10,000 H100 clusters, covering cluster setup, health checks, and performance debugging, with a focus on large-scale LLM training and distributed systems. The speakers share their expertise and experiences, providing valuable insights and practical advice.

Key Takeaways

Configure networking for large clusters
Design hardware and infrastructure from single machine up
Bring up individual machines and network them together
Monitor MFU and identify performance bottlenecks
Implement a local file system mirror for reduced internet bandwidth
Use Ken from Uber as a distributed Docker registry

💡 Training large language models at scale requires careful planning, expertise, and the right tools, and can be achieved with commodity hardware and custom software, but also demands attention to detail and a deep understanding of the underlying systems and technologies.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Stop Starting with Ollama. Build llama.cpp First, If You Really Want to Learn Local LLMs

Learn to build llama.cpp before using Ollama for local LLMs to gain a deeper understanding of the foundation

Why the Same Prompt Gives Different Answers Across LLMs

Discover why the same prompt yields different answers across LLMs and how to experiment with this phenomenon

I Wish Someone Had Shared These Five Secret Codes Much Earlier

Discover 5 secret codes to write clearer articles with less effort using AI prompt shortcuts

I Wish Someone Had Shared These Five Secret Codes Much Earlier

Discover 5 secret prompt codes to write clearer articles with less effort using ChatGPT

Medium · ChatGPT

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)