Ep 18: Petaflops to the People — with George Hotz of tinycorp

Latent Space · Beginner ·🧠 Large Language Models ·3y ago

Skills: LLM Engineering90%LLM Foundations80%Fine-tuning LLMs70%Prompt Craft60%Prompting Basics50%

Key Takeaways

George Hotz discusses tinygrad, a restricted instruction system for ML compute, and its potential to take on Nvidia, Google, and PyTorch, with a focus on making ML compute accessible to everyone, including the use of AMD GPUs and the development of a personal data center called Tiny Box.

Full Transcript

hey everyone welcome to delete and space podcast this is swix write an editor of latent space and Alessio is taking over uh with the intros unless his partner and CTO on residents and decimal Partners hey everyone today we have geohot on the podcast AKA George hotz um for the the human name everybody knows George so I'm not going to do a big intro a couple things that people might have missed so you you were the first to unlock the iPhone you traded the first ever unlocked iPhone for Nissan 350Z and three new iPhones um you were then one of the first people to break into the PS3 around arbitrary code uh you get sued by Sony you wrote a rap song uh to fight against that which is still live on on YouTube which we're gonna have on the show notes then you did not go to Tesla to build vision and instead you started come I um which was an amazing engineering feed in itself until you get a season disease uh from the from the government to not put these things on the street uh turn that into a research only project you know they're out there yeah yeah no no no no they're out there but like in they're not a you know you Market them as a research kind of like nowhere use the word dev kit that's not about the government it has nothing to do with the government we offer a great one-year warranty the truth about that is it's gatekeeping what's the difference between a dev kit and how to get dev kit nothing just the question of do you think it's for you and if you think it's for you buy it it's a consumer product we call it a dev kit if you have a problem with that it's not for you good for me that's great Insight um and then I was going through your blog post to get to the day you you've heard this post about the the hero's journey and you link this thing called the portal story which is kind of the set of stories and movies and books about people living this arbitrary life and then they run to this magic portals kind of takes them into a new very exciting live and dimension when you've wrote that post you talked about tiny grad which is one of the projects you're working on today and you mentioned this is more of a hobby something that is not going to change the course of history obviously you're now going full speed into it so we would like to learn more about what was the portal that you run into to get here well what you realize is you know what made me realize that I absolutely had to do the company seeing Sam want to go in front of Congress why uh what are the odds that they nationalize Nvidia you know what are the odds that large organizations in the government but of course I repeat myself um decide to try to clamp down on uh accessibility of ml compute uh I want to make sure that can't happen structurally so that's why uh I realize that it's really important that I do this and actually from a more practical perspective I'm working with Nvidia and Qualcomm to buy chips Nvidia has the best training chips Qualcomm has the best inference chips uh working with these companies is really difficult uh so I'd like to start another organization that uh eventually in the limit either works with people to make chips or makes chips itself and makes them uh available to anybody you share Canada maybe we can dive into each of them so xla Prime torch those are the complex instruction system tiny grad is the restrict uh restricted instruction system so you're kind of focused on again tiny grad being small not being over complicated and trying to get as close to like the DSP as possible in a way where it's at more well it's very it's a very clear analogy from how processors developed so a lot of processors back in the day were Sisk complex instruction set um system 360 and then x86 uh then this isn't how things stayed they went to now the most common processors alarm um and people are excited about risk five right no one's excited about risk five is even less complex than our um no one is excited about Sisk processors anymore they're excited about redis risk reduced instruction set processors so tiny grad is we're going to make a risk uh offset for all ml models and yeah it can run all ml models with with basically 25 instead of the 250 of xlar print torch so about 10x less complex yeah we talked a lot about existing AI chips you said if you can write a fastml framework for gpus you just cannot write one for your own chip so that's another one of your core insights I don't know if you want to expand on on that yeah I mean your trip is worse right there's no way the chip that you're going to tape out especially on the first try is going to be easier to use than an AMD GPU right and yet there's no good stack for AMD gpus so why do you think you can make one for your chip you can't right the only company there's one other company aside from Nvidia who's succeeded at all at making training chips what company [Music] who's trained to model on AMD or Intel nobody on AMD sorry cerebrus I'm talking about you might know some startups who trained models on these chips I'm surprised no one immediately gets this because there is one other chip aside from Nvidia that normal people have actually used for training sorry neural engine no neutral training no yeah you can only buy them in the cloud exactly yes right sure mid-journey is trading on TPO right like a lot of startups do actually train on tpus and they're the only other successful training Chef aside from Nvidia but what you what's unique about Google is that they also wrote their own ml framework right and if you can't write your own ml framework that is performant on Nvidia there's no way you're going to make a performant on your yeah and they started from tensorflow and then they did they made this ship after yeah that's exactly exactly and you have to you have to do it in that direction otherwise you're going to end up um you know the service whatever those things a million dollars I've never seen a service no one's ever like oh I trained my model on a service most people are like I trained my model on gpus some people 20 are like I train my model on tpus yeah and then the third one which is the one that surprised me the most is uh true and completeness is harmful it should be avoided and make it made sense once I read it but maybe tell us a bit more about how you get there um okay so CPUs uh devote tons of their silicon and power to things like reorder buffers and speculative execution and branch predictors and the reason that you need all these things is because at compile time you can't understand how the code's going to run right this is this is Rice's theorem this is the halting problem in its limit um and this is not like all the halting problem is is theoretical no no no it's actually very real does this Branch get taken or not well it depends on X where does X come from I forget it right um but no branches depend on X in a neural lat every branch is a static Loop like if you're doing a matrix multiply it's a static Loop over the inner Dimension um and neural networks are even better no loads even depend on X right so with a GPU Shader right you're like your load might depend on which texture you're actually loading into RAM but with a neural network your load is well I load that weight why well because I load that weight the other million times I ran the same net every single time you run the net you do the exact same set of loads stores and arithmetic the only thing that changes is the data and this gives you a very powerful ability to optimize that you can't do with uh CPU style things which have branches and even GPU style things which have loads in stores well gpus if you want GPU style stuff you have like load based on X you now need a cache hierarchy and not an explicit cache hierarchy an implicit cache hierarchy with with eviction policies that are hard-coded into the CPU like you start doing all this stuff and you're never gonna get like theoretically good performance again I don't think there's 100x you know some startups will talk about 100x and they'll talk about absolutely ridiculous things like clockless Computing or analog Computing okay here analog Computing just won't work and clockless Computing um sure it might work in theory but your Eda tools are maybe uh AIS will be able to design clockless chips but not humans um but what actually is practical is changing cache hierarchies and removing branch predictors and removing warp schedulers right gpus spend tons of power on warp scheduling because we have to hide the latency from the memory by the latency of everything statically scheduled yeah what do you think people are still hanging on to their incomplete well because it's really easy turn completely is just really easy right it's really easy to just you know just be so nice if I could do like a like an if statement here and actually Branch the code right so it requires a lot more thought to do it without turning completeness and would this be qualitatively different than tpus closer yeah tpus are a lot closer to what I'm talking about than than like like Cuda okay so what is Cuda well kuda's a c like language which compiles to an lvm like IR which compiles to PTX which compiles a SAS which are all term complete uh tpus are much more like this yeah their memory is pretty statically managed they have a I did some reverse engineering on the TPU uh it's published in tiny grad it has like a vliw instruction and it runs them so it's similar I think the tpus have a few problems I think systolic arrays are the wrong choice um systolic array I think they have systolic arrays because that was the guy's PhD and of course Amazon makes could you summarize systems systolic arrays are just um okay so basically you have like this is a way to do matrix multiplication uh think of a grid of Molex and then the grid can multiply and then shift multiply then shift multiply then shift and they are very power efficient but it becomes hard to schedule a lot of stuff on them if you're not doing like perfectly sized dense Matrix multiplies which you can argue will design your models to use perfectly sized dense basic multiply sure but um it's just it's just no but thanks for indulging on on these uh explanations I think we need to keep our audience along with us Yeah by pausing every now and then to explain key terms you know you don't plan a systolic array I just immediately get a picture in my head of like tilting a matrix and shifting it it's hard to kind of explain yeah we'll do something video that you're at your hands and we edit we edit it in visuals yeah yeah there's some great graphics that just show you oh so that's what a systolic array is but it's it's shift machine that looks kind of different from the typical like Apu sort of machine sorry ALU sort of machine I think the right answer is something that looks more like cues that feed into alus and then you can like pre-fetch the loads from the memory put in a bunch of cues and then the cues just like feeds into another queue over here um but yeah but that's not even the main problem with tpus the main problem with tpus is that they're closed Source not only is the chip closed Source but all of xla is open source but the exhalated TPU compiler is a 32 megabyte binary blob called lib TPU on Google's Cloud instances right it's all closed Source it's all hidden stuff and you know well there's a reason Google made it closed Source Amazon made a clone of the TPU it's called inferencia or they have some other name for the training yeah yeah and you look it's cloning the GPO uh it's software doesn't work though like Google software at least kind of works um so those are kind of like the three-quarter pieces uh the first thing you're working on that you've been working on is Tiny grid um and one of the your Twitch streams is that is the the best thing you've ever written um yeah tell us a bit more about um that creation for a long time tiny grad had a hard limit at a thousand lines of code and what this would force you to do is really make sure you were not wasting lines um I got rid of the Restriction because it became a little code golfy at the end but once like the core framework of tiny grad was there in those thousand lines it's it's not huge now it's like 2 800 lines now it's still very readable um but like the core framework the ideas are expressed with no boilerplate if you go read pied torch uh you know Petro testing is actually pretty good code I think Facebook's pretty good um but there's so much boilerplate go go in pytorch and try to track down how an Lu actually works just a lot of distractions oh you're gonna you're gonna be you're gonna be diving down a long stack from python to C to custom libraries to dispatchers to and then I don't even know how to read tensorflow like I don't even know where's the Lu intensive phone nobody knows um someone at Google knows maybe uh Google as an organism though so I don't know if anyone individual at Google knows what are like the important ergonomics like for a developer as you think about designing the timing right API so the tiny grad front end looks very similar to Pi torch um there's an even higher level front end you can use for trying to grab which is just Onyx we support we have better support for Onyx and core amaldas and we're going to have I think we're going to pass Onyx runtime soon too and like people think Onyx runtime that's a gold standard for Onyx no you can do better pass them in what specifically test uh compliance tests okay so Onyx has a big set of compliance tests that you can check out um and we have the running and Tiny grad and there's some failures we're below Onyx runtime but we're Beyond core ml so like that's like where we are in Onyx support now but we will pass we will pass on sometime soon because it becomes very easy to add Ops because of how like you don't need to do anything at the lower levels you just do it at this very high level in tiny grad compiles it to something that's fast using these minimal Ops with um you can like write I mean most concretely what what tiny grad can do that like pytorch can't really do is if you have something like a times B plus C right if you write that in naive pie torch what it's going to do on the GPU is well read a read b in a kernel and then store a times B in memory and then launch another kernel to do a times B plus C okay got to do those loads for memory I know I did a whole extra round trip to memory that I just didn't have to do you're like yeah but you can use the torch jit and it corrects this yeah for that one example for that one example of mullac but oh now you did three multiplies six multiplies right it doesn't uh it won't compile arbitrary code and if you looked into like the other approaches like Pi torch lightning um to accelerate itself well Patrick's lightning my understanding is it's mostly a a framework around pytorch right Pi torch lightning is not going to fix this fundamental problem of I multiply six tensors together why is it going to memory any more than a single read from each and a single right to the output okay um there are there are lower level things in pi torch that are I'm not exactly sure what Dynamo does um but I know they're generating some Triton stuff which is going to generate the kernels on the Fly um but you know Patrick's lightning is as at a higher level of abstraction so Tony grad's front-end stuff looks like pie torch I made a few tweaks there's a few things I don't like about Pi torch why is relio a class oh really like what what was the state it's like you you make a class in this estate everything should just be torch functional and unreliable you but just dot relu on the tensor also like there's things in torch where you have to do tensor Dot and not a tensor dot right um like why why are these things like this just it just shows an API that's like not perfectly refined but when you're doing stuff tiny grad style where you don't have lines well it has to work this way because even the lines to express the well you can't use the where operator unless in the wear operator in pi torch why is it uh true case condition false case yeah oh the worst that's like how python expresses ifs it's disgusting right Turner operators are much nicer it should be I can do my like a less than zero dot where a comma one right the very pandas uh like API yeah yeah yeah yeah it's just it's some it looks like torch numpy pandas they're all very similar I tried to take like the cleanest subset of them and express them but like I said you can also interact with using Onyx yeah um but I have a rewrite of stable diffusion every ride of llama I've rear out of whisper you could look at them they're short in the torch versions and I think they're clean and you stream them all yeah very nice um laziness is kind of the other important concept that you're leveraging to do operation fusing um yeah talk a bit more about that so yeah you have you have basically like a few different like models for uh compute the simplest one's eager all right the simplest one is eager is as soon as the The Interpreter or sees a times B it actually dispatches a times B right then you have graph like uh tensorflow which will put a times B into a graph and then we'll do absolutely nothing until uh you actually compile the graph at the end um I like the Star Choice which is somewhere in the middle laziness laziness is you don't know when the Ops are going to dispatch and don't worry about that you don't have to worry about this as a programmer you just write out all your stuff and then when you actually type dot numpy it'll be ready by the time you you know copy the thing back to CPU or you can do dot realize and it will actually like force that tensor to be allocated in Ram um but yeah a lot of times right like and if you think about it Pi torch is kind of lazy in a way but they didn't extend the Paradigm far enough right when I do a times B in pi torch it's going to launch a Cuda kernel to do a time space but it's not going to wait for that clue to Kernel to complete so you're getting the worst possible world you're getting the same laziness but you also can't get Fusion because pipe torch doesn't know that I'm then going to do plus C there's no way for it to be like whoa whoa don't launch that Cuda kernel whoa let's do this one too right um you can kind of like again this stuff Pi torch is working on this and uh you know it's a little bit harder like in comma I felt like I was competing against a lot of idiots um here I'm competing against you know smart smart very smart people who've made other people yeah who've made some I think different trade-offs right we've made some different trade-offs whereas if you're trying to build something that is just straight up good on Nvidia and we have a lot of people and complexity to throw at it yeah pytorch made a lot of the right choices I'm trying to build something that manages complexity like you can always make your software do more the magic is when you can make your software do more without adding complexity right um because you know complex things eventually collapse under their own weight so it's kind of that how does fusing actually work like like tensorflow actually collapsed under its own right it's kind of that's kind of what happened right how does fusing actually work um so yeah there's this thing called lazy dot High uh and when you do like a times B that's uh it's put into a graph but it's a very uh local graph there's no Global graph optimizations and even this can change right again like the programming model for tiny grad does not preclude eagerness right laziness is not guaranteed laziness it's just going to try its best um so you put in a times B that's a binary app right and then you put in a times B like that's a node and the graph it's a virtual node because it's not realized yet plus C okay here's a new node which takes the C tensor in here and takes the output of a times B it's like whoa wait there's two binary Ops okay we'll just use those together okay here I have a kernel this kernel has a b and c as inputs it does a times B plus C in the local registers and then outputs that to memory and you can graph.1 in tiny grad another another like amazing thing the tiny red has that I've not seen in any other framework is two things uh graph.1 graph equals one which is the environment variable it will output a complete graph of all the operations other people like oh you can use Pi torch exported to Onyx and use netron yeah you can but like what if that's not what's real right graph.1 will show you the actual kernels that were dispatched to the GPU you can also type debug equals two which will print those kernels out uh in your in your in your command line and it will tell you the exact number of flops and the exact number of memory accesses in each kernel so you can immediately see wait a second okay this currently gives this many flops this was the gigaflops this is how many bytes it read and this is the gigabytes per second and then you can profile without having to like okay I mean in theory in pi torch sure use the Nvidia Insight profile no one does of course because it's so difficult right like like actually Nvidia used to a pre-pre I think cuda9 was the last one that had it they had a command line one but now it's like okay I'm going to generate this blob use this Nvidia GUI tool to convert it into a chrome trace and then loading yeah no one does it right I'll just type debug equals two in any tiny grab model and it will show you all the kernels that it launches and the efficiency of each kernel basically yeah this is something that John Carmike has often uh commented about is like when you code you need to build in your instrumentation or observability right into to that I wonder if whatever John is working on he's adopting this style and maybe we can sort of encourage it by by like I don't know naming it and coining it as a certain kind of debugging style if you would if you would like to start contributing to Tiny grad I'd be uh you should hook up with them chatted with a few times I'm not really sure what his company's doing yeah um I think it's all I think it's it's pretty uh uh but no I mean hopefully like we get tiny grad to a point where people actually want to start using it um so China Grade right now is uncompetitive on uh telling competitive on Nvidia it's on competitive on x86 and specifically what do you care about when you say uncompetitive oh I speed okay should have speed it's correct the correctness is there the correctness for both forwards and backwards passes is there but on Nvidia it's about 5x lower than pie charts right now like 5x wow this is this is unsurmountable no there's reasons it's 5x lower and I can go through how we're going to make it faster and it used to be you know 100x slower so you know we're making progress but um there's one place where it actually is competitive and that's Qualcomm gpus uh so tiny grad is used to run the model in open pilot like right now it's been live in production now for for six months um and Tiny grad is about 2x faster on the GPU than qualcomm's Library um and why specifically Qualcomm well because we have Qualcomm we use Qualcomm in the common devices oh I mean like what makes what makes what's what about qualcommer architecture oh what makes it doable yeah well because the world has spent how many millions of man hours to make Nvidia fast and Qualcomm has a team of 10 Qualcomm Engineers okay well who can I be here let's like like what I propose what I propose with tiny grad is that developer efficiency is much higher but even if I have 10x higher developer efficiency I still lose on a video right you know okay I didn't put a hundred thousand man hours into it right if they put a million like like that's what I'm saying but that's what I'm saying we can get and we are going to close this speed Gap a lot like I don't support tensor course yet that's that's that's a big one that's just gonna uh okay massively close the gap and then AMD uh I can't even get I don't even have a benchmark for AMD because I couldn't get it compiled oh and I tried oh I tried I spent a day like I spent actually a day trying to get pie torch and then built I got it kind of working then I tried to run a model like there's all kinds of weird errors in the rabbit holes are so deep on this I'm like um so we you know you can compare the speed right now you can run llama you can run anything you want on AMD it already all works any opencl backend works and it's not terribly slow I mean it's a lot faster than crashing so it's an infinitely times faster than Pi torch on AMD um but pretty soon we're going to start getting close to theoretical maximums on AMD that's really where I'm pushing and I want to get MD on ML perf in a couple months hopefully now that you bring up AMD yeah let's dive into that because when you announce the tiny Corps fundraise you mentioned one of your first goals is like build the framework Grind Time and driver for for empty and then on on June 3rd on Twitch uh you're weren't as excited about AMD anymore maybe let's talk a bit about that and like uh you compared the quality of like combat messages from like the MD kernel to like the Intel work that people are doing there what's important to know so when I said I wanted to I want to write a framework I did never intended on writing a kernel driver I mean like I've flirted with that idea briefly but like realistically I I like there's three parts to it right there's like the ml framework there's the driver and then there's the user space runtime I was even down to rewrite the user space runtime I have I have a GitHub repo called cudaio control sniffer it's terribly called but you can actually launch a Cuda kernel without Cuda also you don't need Cuda installed just the Nvidia open source driver and this open source repo can launch a Kuda kernel so rewriting the user space runtime is doable rewriting the kernel driver I don't even have docs I don't have any docs for the GPU like it would just be a massive reverse engineering project um so that is when I saw that there like it wasn't like I wasn't complaining about it being slow I wasn't complaining about pie torch not compiling I was complaining about the thing crashing my entire computer it panics my kernel and I have to wait five minutes while it reboots because it's a Server Motherboard and they take five minutes to reboot um so I was like look if you guys do not care enough to get me a decent kernel driver there's no way I'm wasting my time on this especially when I can use Intel gpus until gpus have a stable kernel driver and they have all their Hardware documented you can go and you can find all the register docs on Intel gpus so I'm like why don't I just use these now there's a downside to them uh their GPU is 350 and you're like what a deal it's 350 you know when you get about 350 worth of performance and if you're paying about 400 for the pcie slot to put it in right like between the power and all the other stuff you're like okay never mind you got to use Nvidia or AMD um from that perspective but I sent an email to Lisa Sue she responded nice oh you can see you published that email in a Discord I did I did and she responded um and I've had a few calls since and like what I did was like what I tried to do well first off like thank you for responding it shows me that like if you don't care about your kernel panicking I can't like like this is just a huge waste of my time right I'll find someone who will care like I do I'm not asking for your seven by seven win a grad convolution when transposed to be fast like I'm not asking for that I'm asking literally for the basics and this isn't tiny grad this is your demo apps I ran their demo apps in loops and I got kernel panics I'm like um but no uh Lisa reached out connected with a whole bunch of different people uh they sent me a pre-release version of rock M uh Five Points X they told me you can't release that much I'm like I don't care but um they say they're gonna release it by the end of the month and it fixed the colonel panic if the guy managed to reproduce it uh with the two gpus in the computer uh and yeah sent me a driver and it works so um yeah I had I had bad experience uh and then I had another experience where I had two calls with like amd's like communication people and just like explain to these people like open source culture like it's not open source if you dump the source code on a GitHub repo and then forget about it until the next release it's not open source if you know all your issues are from 2022 like like it's just no one's going to contribute to that project right sure it's open source in a very like technical sense it to be fair it's better than nothing it's better than nothing but um I fixed a bug and nickel that I fixed a there's a fun fact by the way if you have a consumer consumer AMD GPU they don't support peer-to-peer um and they're already span with this horrendously slow because it's using Cuda kernels to do the copy between the gpus and it's putting so many transactions on the pcie bus that it's really slow but you can use clue to mem copy and there's a flag to use kudame copy but that flag data Buck um so I've I posted uh the issue on nickel I expected nothing to happen the Nvidia guy replied to me within an hour he's like try this other flag I'm like okay I tried the other flag it still doesn't work but here's a clean Repro and I spent like three hours writing a very clean Repro I ended up tracking the issue down myself but just the fact that somebody responded to me within an hour and cared about fixing the issue okay you've shown that it's worth my time and I will put my time in yeah because let's make this better like I'm here to help um but if you show me that you know you're like you're the kernel panics let's just like expect it yeah okay well it sounds like EMD is getting the message they are and I just I don't really think they've had someone explain to them like like I was like you could like build in public and they're like what's an example of building in public I'm like go look at pytorch go look at pie torch right like you know I have I have two minor things merged into pie torch because it's very responsive you know it's only minor bug fixes but I feel like it's you know yeah um so that's kind of like the lowest level of the stack and then at a slightly hover level obviously there's tiny grad there's Mojo uh there's the Gmail how are you thinking about breadth versus like depth and like where you decided to focus early on um so ggml is very much like a okay everyone has them once right actually that's what I was thinking in the beginning I was thinking of something more like ggml focus on the m1s but ggml showed up and was just like we're actually just focusing on the m1s um so and actually M1 pytorch is considerably better than AMD Pi torch and when pytorch works it only gives wrong answers sometimes it only crashes sometimes but like some models kind of run um when I was writing the uh metal back end I was comparing to MPS Pi torch and I had like a discrepancy like tiny grid checks all its outputs compared to torch one where it didn't match I'm like I really I I checked the matrix by hand it matches tiny grad I don't understand and then I switched pytorch back to CPU and it batched and I'm like oh yeah well there's like bugs like if you like transpose The Matrix because like I think it's like has to do with like multi-views and Pie torch and like weird under the hood stuff that's not exposed to you like there's bugs and maybe they fix them but like you know it seems like there was a lot of momentum again because you're getting a huge variety you're getting how many Engineers care about making pie charts work on M1 right thousands tens of thousands yeah and you have an open development process and guess what it's going to be good how many Engineers care about AMD working through my torch AMD working oh you got 10 guys that work for AMD and then like a couple hobbyists you revealed an interesting detail about how you debug uh which is you check you hand check the matrix math no I don't hand check it there's a there's a one of the best tests in tiny grad is a file called testops.pi and it's just a hundred small examples written in tiny grad and pytorch and it checks both the forwards and backwards to make sure they match the test Suite yeah very important that's I mean that's one of them where you like I really I put a lot of effort into the CI for Tony grad I think CI is super important like I want that green check to mean I can merge this yeah all right I don't want my tests to and if the green check if you somehow manage to introduce a bug and get the green check okay we're fixing the test top priority yeah uh Mojo it's closer uh no I'm not that interested you know what I mean like like look I I like Chris lattner I I think he's gonna do great things and I understand the uh the like kind of the wisdom even in keeping a closed Source but uh you know I'm interested when it's open yeah right you have an interesting design uh deviation from him because he's decided to be a well promised to be a superset of python and you have decided to break uh with with pi torch apis uh and I think that's that affects learnability and and trans transportability of code you know torch thing ends up being like uh like a stumbling block I could write a perfect pie torch uh like I'd like I'd like a you know instead of import pie torch instead of like yeah import torch you type import tiny torchous torch and if that really becomes the stumbling block okay I will do that um no Chris Leonard went much further than pytorch replicating the pi torch API is something I can do with a couple you know like an engineer monster right like a shim yeah um replicating python there's a there's a big graveyard of those projects how's uh piston going how's oh a jython high Pie as a whole you can go way back um so tiny grab this one layer you anon's Tiny Box recently which is um you know you made it so your core mission is uh commoditizing the petaflop and then your business goal is to sell computers for more than a cost to make which seems super reasonable uh what are and you're gonna have three tiny boxes red no no no no no no that was my look you know a lot of people like I love you know leaning into like saying I'm giving up right it's great to give up or giving up is this wonderful thing it's so liberating and then like you can decide afterward if you really give up or not there's very little harm in saying you give up except like you know great Twitter haters have something to talk about and I'll press is good press kids so um so obviously just red only red Tiny Box red Tiny Box red unless AMD you know upsets me again and then we have other colors to choose from when you think about Hardware design what are some of the numbers you look for so teraflopsis per second is one uh but like memory bandwidth is another big limiter like how do you make those startups well I mean fundamentally unlimited what gpus I can buy but uh yeah for for something that I think a lot of people are going to want to reasonably do with um uh uh a co-worker of mine described them as luxury AI computers right like luxury AI computers for people and that's like what we're building and I think a common thing people are going to want to do is run like large llama right or large like falcon or whatever 16 level fp16 exactly exactly um you know I end date I think can work I think that like what ggml is doing to go to like infor I think this doesn't work like have you done maybe they have have but like I read what it was and I was like this isn't from any paper this is just some like you're amazing as much as possible yeah you made up some quantization standard to make it run fast and like like maybe it works but okay where's like the Hella swag number all right where's your where's your where's your uh you know all your the thesis right that like if you have billions hundreds of billions of parameters that the individual quantization doesn't actually matter that much well the real way to look at all of that is to just say you want to compress the weights right it's a formal weight compression quantization is a form of weight compression right now this is obviously not lossless it's not a lossless compressor right it's a lossless compressor and you can show that it's correct then okay we don't have to have any other conversation but it's a lossy compressor yes and how do you know that your loss isn't actually losing the power of the model maybe maybe int for 65b llama is actually the same as FB 16 7B llama right and we don't know uh maybe someone has done this yet but I looked for it when it like first came out and people were talking about it and I'm like I just have like it's not from a paper right the in-date stuff is from a paper where they like somebody paper there's one paper I think it's like into llm.ind date where they actually uh you know do all the tests and they didn't go fully in date they they made like 90 of it in day and kept like 10 of it in FB 16 for what they called like the like outliers or whatever um so I think that this is not quite so easy and I think being able well so first off if you're training no one's gotten training to work within date yeah there's a few papers that vaguely show up if you're training you're going to need uh bf16 or float 16. um so this is why I Target that now the thing that you're going to want to do is run these large language models out of the box on your Hardware in fp16 and that's memory bandwidth so you you need you need large amounts of memory bandwidth too uh so ask how I trade off memory around within flops I'll let GPS can I buy but um and I saw one of your about so first of all you have this hiring process which is you've got to solve one of the bounties that are open on Tiny grad there's no technical interview one of them is intake support do you already have some things you want to test on uh we haven't date support um what I'd like to see somebody do is just load the ggml into a llama into tiny grad and then Benchmark it against the FB 16 one uh intake already Works in in tiny grade it doesn't actually do the math and in date which is even a it's just even a stronger like it does all the math still in fp32 so intake can mean you just have your weight tenant date or intake can mean you actually do your math in a date and doing your math and end date the the big like gain that people care about is actually uh having your weights in a day because weights in a date mean less memory and less memory bandwidth uh where is the math keep it in Fe 32. with with on on m1s it doesn't even matter if you're doing it doesn't matter what data type you're doing in the for the in the GPO I'm not sure it can do in date but fp16 and fp32 is the same it's the same type of Ops um so yeah no that's one of the bounties one of the bounties is get get into eight llama running with the intake weights and then actually you don't even need to what you could even do if you really want to test this just take the FB 16 ways convert them to indeed then convert them back to fp16 then compare the unconverted and converted oh that's a nice hack oh yeah right like like it should be lossless in the other direction it won't yes uh yeah I think fp16 it should be loss less than the other direction I'm actually not 100 about that why not uh oh because like you ever try to like like if you want to represent if it was like in 16 it's not lossless I think I think all of intake can be represented in fv16 but I'm not 100 about that okay actually I think it we just draw the bites we just have to do it right just literally do it there's only 256 to check like um but yeah either way or I mean into four definitely so do your N4 convert it back and now see even within four weights and FP 32 math like okay how much does your performance degrade of this model yeah yeah so can we uh I I'm about to zoom out a little bit from the details I don't know if you you had more no I think like the you're playing to release the first Tiny Box ship them in like two to six eight months something like that uh what's up online for you in terms of building the team who should who are you calling for yeah uh well to to stay on the Tiny Box for for 400 yeah exactly um so if the gpus picked out and you're like well I could make that computer with the gpus and my answer is can you do you know how to put do you know how hard it is to put six gpus in a computer people think it's really easy and it's really easy to put one GPU in a computer it's really easy to put two gpus in a computer but now you want to put in eight okay so I'll tell you a few things about these gpus they take up four slots what kind of computer you can buy the nicest super micro you can't put eight of those in there you need two soft blowers if you want to use one of those for you super micros you need two slot blowers right or water cooling right if you're trying to get the four Slot Cards in there you're going to need some form of water cooling uh or you're going to need there are some like Chinese 40 90s that are blowers right you have any blowers or water cooling if you're trying to get it in those things right um so you're doing water no I'm not using that chassis okay um then the other thing that okay so now you want to get six gpus on a computer so that's a big challenge you're like oh I'll just use a pcie extenders I saw it online as Tech tips it works great no it doesn't try pcie extenders that work at pcie 4.0 and interconnect bandwidth super important yes Google work at 3.0 no pcie extender I've tested and I've bought 20 of them uh works at PCI 4.0 so you're going to need PCI redrivers now okay how much is that adding cost right like these things all get really hard and then tiny boxes I've even had another constraint to it I want this thing to be silent not totally silent but my limit is like 45 maybe 50 DB but not super micro machine 60 DB we have a small we have a compute cluster at comma Hey You Gotta Wear You Gotta Wear ear protection to go in there like yeah I've seen some videos where you give a tour oh yeah yeah it's super noisy super loud 10 000 RPM just screaming like I want to be able to use the normal big GPU fans and make this thing so it can sit under your desk plug into one Outlet of power right six gpus but there's your your gpus at 350 watts each can't plug that into a wall outlet okay so how are you going to deal with that good questions right um and you're not sharing them well that one I mean that one is pretty obvious you have to limit the power on the gpus right um You have to limit the power on the gpus now you can limit power on gpus and still get you can you can use like half the power and get 80 of the performance this is a known fact about gpus but like that's one of my design constraints so when you start to add all these design constraints good luck building a tiny box yourself um you know obviously it can be done but you need something that has actually quite a bit of scale and resources to do it and you see like the under the the desk it's like one of the main use cases kind of like individual developer use or yeah what I also see is more of a like an AI hub for your home right as we start to get like home robotics kind of stuff don't want to put the inference on the robot but you also don't want to put the inference on the cloud uh we don't put on the robot because okay it's 1500 watts Tiny Box he'll put batteries don't charge them bad idea and just just wireless wireless is 0.5 milliseconds yeah right this is super fast um you don't want to go to the cloud for two reasons one uh Cloud's far away okay it's not that far away you can kind of address this uh but two Cloud's also mad expansive yeah like Cloud gpus are way more expensive than running that GPU at your house at least any rate you're gonna get right maybe if you commit to buy well yeah I'm gonna buy 10 000 gpus in three years then maybe the cloud will give you a good rate but like you want to buy you want to buy one GPU in the cloud I mean okay you can go like fast but like if you're a drug going on Azure AWS so that's expensive yeah this is like a like a personal data center you know instead of a cloud data center we like the term compute cluster so we can use Nvidia gpus data centers may be a little bit dated cluster which is totally legal under the Cuda license agreement you talk a lot about the pcie connection do you think there's any fat there to trim what do you mean uh just you're limited by bandwidth right okay for some things yes um so the bandwidth is the is roughly 10x less than what you can get with NV linked a100s yeah right and vlink a100s they're going to have and then you can even get like full Fabric and the Nvidia really pushes on that stuff um 600 gigabytes per second right and pcie four you're gonna get 60. right so you're getting 10x less yeah um that said why do you need the bandwidth right and the answer is you need it for training huge models if you're training on a tiny box your limit's going to be about 7 billion right if you're if you're training on Big Stuff your limits could be like 70 billion right okay you can hack it to get a bit higher you can hack it like gbt hacked it to get a bit higher but like that 65 billion in llama like there's a reason they chose 65 billion right and that's what can reasonably fit model parallel on on agpus right so um yes you you are going to end up training models the Cap's going to be like 7 billion but I actually heard this on your podcast I don't think that the best chatbot models are going to be the big ones I think the best chatbot models are going to be the ones where you had a thousand training runs instead of one and I don't think that the interconnect bandwidth is going to matter that much so what are we optimizing for instead of compute optimal uh what do you mean compute optimal uh so the this is you're talking about this um the Llama style models where you train for like 200 you train longer yeah yeah yeah so okay you can always make your model better by doing one of two things right and a comma we just have a strict limit on it um you can always make your model better by training longer and you can always make your model better by making it bigger but these aren't the interesting ones right particularly they're making it bigger because training it longer fine you know you're getting a better set of Weights the inference is the same the inference is the same whether I trained it for a day or a week yeah but the okay if it's one billion versus 10 billion well I 10x my inference too right so I think that these big models are kind of uh sure they're great if your research labs and you're trying to like max out this High technical thing which you can talk about later yeah yeah yeah but if you're but if you're like a startup or you're like an individual or you're trying to deploy this to the edge anywhere you don't you don't need that many weights yeah yeah you don't want them anyway it's amazing for inference rather than capabilities doing benchmarks yes yes um and I think the the inference thing right there's going to be so much more right now the ratio between like training and inference on clouds I think it's only still I think it's like two or three acts right there's two or three x more inference which doesn't make any sense right there should be way more inference yeah there should be a 10 to 100 x more inference in the world than than training um but then also like what is training right you start to see these things like Laura like you're getting kind of it's kind of blurring the lines between inference and training and I think that that blurred line is actually really good I'd like to see much more like on device training or on device fine-tuning of the final layer yeah um where we're pushing toward the stuff at comma right like why am I shipping a fixed model I totally want this model to fine tune based on like how you know your left tire is flat right like every time you cut the same turn because your left tire is flat well it should learn that right so would comma pursue perimeter efficient fine-tuning yeah yeah where it seems like good we're looking at stuff like that I mean com is already very parameter efficient because we have to like run this thing in a car and you have to like cool and power it yeah yeah yeah and so that's kind of like intelligence cluster you have in your home you see when the person is using third-party model they load them locally and kind of do the final fine tuning it kind of stays within the Box yeah I think that that's one thing that's one version of it for the Privacy conscious I also see a world where uh you can have your Tiny Box in its down Cycles um mine flop coin right you know not all it turns out not all crypto is a scam there's one way to tell if crypto is a scam if they're selling the coin before they make the product it's a scam if they have the product it's maybe not a scam right so yeah my thought is like each shiny box would let you would have a private key on it uh and you have to do it this way you can't just let anyone join because of civil attacks right there's a real problem of like how do I uh how do I ensure your data is correct and the way that I ensure your data is correct on the tiny Nat is if you ever send wrong data your 15 000 Hardware box is banned so you know don't cheat um obviously if it messes up we'll forgive you but um I'm saying like somebody's gonna try to jailbreak your devices there's no jailbreak no jailbreak there's just a different network there's just a private key on e

Original Description

How tinygrad is taking on Nvidia, Google, and PyTorch with a tiny team, building in public with AMD, hot takes on ggml, Mojo, and GPT-4, and why AI Girlfriend is next. Writeup and show notes: https://www.latent.space/p/geohot Hosts' Twitter: @swyx and @fanahova Timestamps: 00:00:00 - Introducing George 00:02:59 - Tinycorp's 3 Theses 00:11:12 - Tinygrad's creation 00:15:58 - Operation fusing in Tinygrad 00:19:11 - Tinygrad debugging 00:21:14 - Tiny Competitiveness on QCOMM vs NVDA 00:23:21 - geohot vs AMD 00:28:21 - Tinygrad vs ggml 00:30:01 - Importance of Good CI 00:30:37 - Mojo and Compatibility 00:32:43 - ggml quantization is made up 00:35:18 - tinygrad: benchmark int8 vs fp16 00:37:39 - Why you can't build tinybox 00:40:28 - The personal compute cluster 00:43:08 - Compute Optimal to Inference optimal 00:45:06 - Announcing FLOPcoin 00:46:23 - Why Federated AI won't work 00:47:38 - 5x faster than Nvidia 00:48:53 - A Person of Compute 00:49:49 - GPT-4's real architecture 00:51:07 - BatchNorm, FlashAttention 00:52:34 - The Bitter Lesson 00:55:31 - Hiring in the Age of AI 01:00:02 - Why AI doesn't replace developers & artists 01:03:02 - Comma Body 01:07:34 - AI Girlfriend 01:11:00 - The Goddess of Everything Else 01:13:43 - John Carmack Insights 01:17:41 - on Elon 01:18:47 - on e/acc 01:20:24 - Avatar 2

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Latent Space · Latent Space · 1 of 60

← Previous Next →

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Ep 18: Petaflops to the People — with George Hotz of tinycorp

FlashAttention-2: Making Transformers 800% faster AND exact

FlashAttention-2: Making Transformers 800% faster AND exact

RWKV: Reinventing RNNs for the Transformer Era

RWKV: Reinventing RNNs for the Transformer Era

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

RAG is a hack - with Jerry Liu of LlamaIndex

RAG is a hack - with Jerry Liu of LlamaIndex

The End of Finetuning — with Jeremy Howard of Fast.ai

The End of Finetuning — with Jeremy Howard of Fast.ai

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Four Wars of the AI Stack - Dec 2023 Recap

The Four Wars of the AI Stack - Dec 2023 Recap

The State of AI in production — with David Hsu of Retool

The State of AI in production — with David Hsu of Retool

Building an open AI company - with Ce and Vipul of Together AI

Building an open AI company - with Ce and Vipul of Together AI

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Making Transformers Sing - with Mikey Shulman of Suno

Making Transformers Sing - with Mikey Shulman of Suno

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Why Google failed to make GPT-3 -- with David Luan of Adept

Why Google failed to make GPT-3 -- with David Luan of Adept

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Breaking down the OG GPT Paper by Alec Radford

Breaking down the OG GPT Paper by Alec Radford

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

LLM Asia Paper Club Survey Round

LLM Asia Paper Club Survey Round

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How AI is Eating Finance - with Mike Conover of Brightwave

How AI is Eating Finance - with Mike Conover of Brightwave

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

State of the Art: Training 70B LLMs on 10,000 H100 clusters

State of the Art: Training 70B LLMs on 10,000 H100 clusters

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Synthetic data + tool use for LLM improvements 🦙

Synthetic data + tool use for LLM improvements 🦙

RLHF vs SFT to break out of local maxima 📈

RLHF vs SFT to break out of local maxima 📈

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Answer.ai & AI Magic with Jeremy Howard

Answer.ai & AI Magic with Jeremy Howard

Is finetuning GPT4o worth it?

Is finetuning GPT4o worth it?

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Building AGI with OpenAI's Structured Outputs API

Building AGI with OpenAI's Structured Outputs API

Q* for model distillation 🍓

Q* for model distillation 🍓

Finetuning LoRAs on BILLIONS of tokens 🤖

Finetuning LoRAs on BILLIONS of tokens 🤖

Cursor UX team is CRACKED 💻

Cursor UX team is CRACKED 💻

Choosing the BEST OpenAI model 🏆

Choosing the BEST OpenAI model 🏆

How will OpenAI voice mode change API design?

How will OpenAI voice mode change API design?

STEALING OpenAI models data 🥷

STEALING OpenAI models data 🥷

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

Prompt Engineer is NOT a job 📝

Prompt Engineer is NOT a job 📝

Prompt Mining LLMs for better prompts ⛏️

Prompt Mining LLMs for better prompts ⛏️

The six pillars of few-shot prompting 🔧

The six pillars of few-shot prompting 🔧

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Can you separate intelligence and knowledge?

Can you separate intelligence and knowledge?

George Hotz discusses tinygrad and its potential to make ML compute accessible to everyone, with a focus on using AMD GPUs and developing a personal data center called Tiny Box. The discussion covers various topics, including LLM engineering, fine-tuning, and prompting.

Key Takeaways

Build a restricted instruction system for ML compute using tinygrad
Optimize neural networks for performance using tinygrad
Develop a personal data center for ML compute using AMD GPUs
Fine-tune large language models using quantization and weight compression
Craft effective prompts for LLMs to improve performance

💡 Making ML compute accessible to everyone is crucial for the development of AI, and tinygrad has the potential to disrupt the current dominance of Nvidia, Google, and PyTorch in the ML compute market.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related AI Lessons

Progress for Machines, Obedience for People

Learn to critically evaluate the impact of technology on society and distinguish between progress for machines and obedience for people, understanding the importance of responsible AI development and deployment.

Amazon Nova: AWS’s Bid to Turn Enterprise AI Into Cloud Infrastructure

Learn how Amazon Nova is turning enterprise AI into cloud infrastructure with its Nova model family and full-stack approach

LangChain Was Powerful — But PydanticAI Was the Missing Layer

Learn how PydanticAI fills the gap in LangChain for production AI applications, and why it's a crucial layer for many use cases

Medium · Python

When Your LLM Output Is Garbage: Building a Self-Correcting JSON Pipeline

Learn to build a self-correcting JSON pipeline to fix garbage LLM output and improve data extraction accuracy

Chapters (31)

Introducing George

2:59 Tinycorp's 3 Theses

11:12 Tinygrad's creation

15:58 Operation fusing in Tinygrad

19:11 Tinygrad debugging

21:14 Tiny Competitiveness on QCOMM vs NVDA

23:21 geohot vs AMD

28:21 Tinygrad vs ggml

30:01 Importance of Good CI

30:37 Mojo and Compatibility

32:43 ggml quantization is made up

35:18 tinygrad: benchmark int8 vs fp16

37:39 Why you can't build tinybox

40:28 The personal compute cluster

43:08 Compute Optimal to Inference optimal

45:06 Announcing FLOPcoin

46:23 Why Federated AI won't work

47:38 5x faster than Nvidia

48:53 A Person of Compute

49:49 GPT-4's real architecture

51:07 BatchNorm, FlashAttention

52:34 The Bitter Lesson

55:31 Hiring in the Age of AI

1:00:02 Why AI doesn't replace developers & artists

1:03:02 Comma Body

1:07:34 AI Girlfriend

1:11:00 The Goddess of Everything Else

1:13:43 John Carmack Insights

1:17:41 on Elon

1:18:47 on e/acc

1:20:24 Avatar 2

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)