FlashAttention-2: Making Transformers 800% faster AND exact

Latent Space · Advanced ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations90%LLM Engineering80%Fine-tuning LLMs70%

You can read the full write up on FlashAttention's architecture and inner workings on our blog: https://www.latent.space/p/flashattention 00:00:00 - Tri's background 00:02:18 - FlashAttention’s deep dive 00:17:21 - How the Hazy Research group collaborates across theory, systems, and applications 00:25:00 - Evaluating models beyond raw performance 00:27:00 - FlashAttention-2 00:30:00 - CUDA and The Hardware Lottery 00:35:00 - Researching in a fast-changing market 00:37:30 - Promising transformer alternatives like state space models and RNNs 00:43:00 - The spectrum of openness in AI models 00:47:12 - Practical impact of models like LLAMA2 despite restrictions 00:49:43 - Incentives for releasing open training datasets 00:53:22 - Lightning Round

What You'll Learn

The video discusses FlashAttention-2, a technique that makes Transformers 800% faster and exact, and its applications in efficient Transformer training and inference, long range sequence models, and exact attention. It also covers the collaboration between the Hazy Research group and industry, and the importance of evaluation and measurement in emerging use cases like chatbots.

Full Transcript

today we have no swix because he's in is in Singapore so uh it's a it's a one-on-one discussion with a tree Dao welcome hi everyone I'm I'm Trina I'm excited to be here so three just completed his PhD at Stanford a month ago um you might not remember his name but he's one of the main authors in the flash attention paper which is one of the seminal work in the Transformers era he's got a lot of interest from efficient Transformer training and inference uh long range sequence Model A lot of interesting stuff and now you're gonna be a assistant professor in CS at Princeton next year yeah that's right yeah nice and in the meantime just to get you know a low pressure thing you're a cheap scientist at together as well which is the company behind the red pajama yeah yeah so I just joined um this week actually and it's been really exciting yeah nice uh so what's something that is not on the internet that people should know about you hmm let's see um I think uh before when I started college I thought I was going to be an economist so I was fully on board I was gonna major in economics but the first week I was at Stanford undergrad I took a few math classes and I immediately decided that I was going to be a math major and that kind of changed the course of my career so now I'm doing kind of math computer science AI research nice that's a you know I had a similar thing I started with a with physics and then I took like a programming course and I was like I gotta do computer science I don't want to do physics um so flash attention is definitely you know everybody's using this everybody loves it you just release flash attention to this last week um yeah early this week on Monday yeah yeah and you know yeah time yeah let's move fast yeah four days ago is one week ago in AI uh so maybe let's run through some of the flash attention highlights um some of the Innovation there yeah and then we can dive into flash attention too so the core Improvement in flesh attention is that traditional attention is a quadratic sequence line so and to the to the two flash attention is linear which obviously helps with uh with scaling some of these models right so so the two factors there so of course the goal has been to make attention um go faster or more memory efficient um and ever since uh tension became popular in um 2017 with the Transformer paper lots and lots of folks have been working on on this um and a lot of approaches has been focusing on approximating attention the goal is so you want to scale to longer sequences there are tons of applications where you want to do that but scaling to longer sequences is difficult because attention quadratically in sequence link and both runtime and memory as you mentioned um so instead of trying to approximate attention we were trying to figure out you know can we do the same computation and maybe be more memory efficient so in the end we ended up being the memory is linear in sequence length in terms of computation it's still quadratic but we managed to make it much more Hardware friendly and as a result we do get wall clock speed up on the order of two to four x which really helps because that that just means that you're able to train with two to four x longer sequence then for the same cost without doing any approximations as a result lots of folks have been using those I think it's available in a lot of libraries that train that train do language model training or fine-tuning yeah and the approximation thing is important because this is a exact thing versus like the sparse so maybe explain a little bit the difference there for sure for sure yeah so um you know attention um essentially you compute pairwise similarity between um every single element in the sequence against each other so there's been other approaches where instead of doing all that kind of pairwise computation you only compute similarity for you know some pairs of of of elements in the sequence so you don't do kind of quadratic number of comparison um and and this can be seen as a some form of sparsity essentially you're ignoring some of the elements when you write down the Matrix you essentially say okay I'm going to pretend they're zero um and and that could uh that that has some benefits in terms of runtime and and memory um but the trade-off is uh it tends to do worse in terms of quality because you're essentially approximating or ignoring some some elements and I personally have worked on this as well for a few years but when we talked to practitioners who actually train models especially at Large Scale they say well we tend not to use these approximation approximate attention methods because turns out this this was surprising to me at the time was that these approximation methods even though they perform fewer computation they tend to not be faster in walk octav so this is it was pretty surprising because back then I was um I think my background was more on the theoretical side so I was thinking of oh how many flaps of floating Point operations um are you performing and hopefully that correlates well with wall clock's ham um but I realized that I was missing a bunch of ideas from the system side where there's a uh flops or floating Point operations don't necessarily correlate with runtime there are other factors like memory reading and writing parallelism and so on so I learned a ton from just talking to systems people because they they kind of figure this this stuff out you know a while ago so that was really eye-opening and then we ended up focusing a lot more on memory reading and writing because that turned out to be the um the majority of Tamp when you're doing attention is reading and writing memory yeah yeah the io awareness yes it's probably like one of the biggest Innovation here um and the idea behind it is like you mentioned the the flops growth of the cars have been going up but like the the memory bandwidth not as much so yeah I think like maybe that was one of the assumptions that the other General attention paper add yeah um so talk a bit about how that came to be as an idea you know like how it you know it's one of those things that like an Insight is like obviously why are we like rewriting to like hbm every time yes you know and like once you once you change it it's clear but um what was that Discovery process yeah yeah so I think in in hindsight um a lot of the ideas have already been there in the literature um and I would say is um it was somehow at the intersection of both machine learning and systems um and you kind of needed ideas from both both sides um so on one hand um on the system side so lots of systems folks have have known that oh you you know kernel Fusion is is great um kind of fusion just means that instead of Performing you know loading um the same element um and uh instead of Performing um an operation and write it down load it back up and perform the second operation you just load it once perform two operations and then write it down again so that saves you kind of a memory written right in the middle there so kernel Fusion has been a classic there's been other techniques from the system side like tiling where you perform things in in the performance computations in block again so that you can load it into a really fast memory think of it as a cache and this is again classical computer science ideas right you want to use the cache so the system stocks have been thinking about these ideas for for a long time and they apply to attention as well but there were certain things in attention that made it difficult to do in a complete kernel fusion um one of which is there is this softmax operation in the middle which requires you to essentially sum across the the row of the attention Matrix so it makes it difficult to kind of break it because there's this dependency so it makes it difficult to break things into a block so on the system side has been um people have been thinking about these ideas but it's been difficult to kind of do kind of Fusion for the entire operation on the machine learning side people have been thinking more algorithmically they say okay either we can change the we can approximate attention or um there's this trick called the online softmax trick which says that you can because of softmax the way it's written mathematically you can actually break it up into smaller pieces do some rescaling and still get the right answer so this online softmax trick has been around for a while I think there was a paper from Nvidia folks back in 2018 about this and then there was a paper from Google so so um Mac is Robin and and and starts wrote a paper late 201 on using this online softmax trick to break attention up into smaller pieces so a lot of the ideas were already there um but turns out um I think if you um you kind of need to combine ideas from both sides so you need to understand that hey we want to do kind of fusion to reduce memory written right but we also need this online softmax trick to be able to break the soft pack into smaller pieces so that a lot of the system strikes um kind of carry through right and and so um you know we we saw that and and it was kind of a natural idea that we we um we end up using ideas from both sides and it ended up working pretty well yeah are there any downsides to Kernel Fusion uh if I think about databases and the reasons why we have a atomic operations you know it's like you have observability and fallback in between them um yeah how does that work with attention like is there anything that we lose by by fusing the operations yeah I think mostly on the kind of on the Practical side is that um when you do kernel Fusion is a little bit uh you lose a little bit of flexibility in the sense that hey now you have um for example it's uh faster attention it's just a subroutine that you would call to do attention but as a researcher let's say um you don't want that exact thing right you don't want just attention let's say you want some modification to attention you want to do hey I'm going to multiply the query and key but then I'm going to do this extra thing before I you know carry on um and so Chrono Fusion just means that okay and we have a subroutine that does the entire thing but if you want to experiment with with things you won't be able to use that that fused kernel and of course the the um the answer is can we have a compiler that then automatically does a lot of this this Kono fusion um and you know lots of compiler folks are are thinking about this um either with a new language or with um you you can embed it in pi torches so the pytorch folks have been working on this as well so if you write just your code in in pi torch and and they can capture the graph can they generate um code that will kind of fuse everything together and that's still ongoing and it works for some cases but for attention because of this kind of soft Max rewriting um stuff it's been a little bit more difficult so um maybe in a year or two we'll we'll have compilers that are able to do a lot of these optimizations for you and you don't have to for example spend a couple months writing Cuda to get this stuff to work awesome and just to make it clear for listeners when we say we're not writing it to memory we are restoring it but just in a faster memory so instead of the hbm we're putting it in the in the SRAM yeah that's right yeah yeah maybe explain just a little bit the the difference there yeah for sure so um this is um uh kind of a caricature of how you you think about accelerators or gpus in particular is that they have a large pool of memory usually called hbm high band with memory so this is what you think of as GPU memory so you know you're using um a100 and you you list the GPU memory is like 40 gigs or 80 gigs so that's that's the that's the hbm um and then um when you perform any operation um you need to move data from the hbm to the compute unit so the the actual Hardware unit that does the the computation and next to the these compute units there are um uh it's got on chip memory or SRAM which are much much smaller than hbm but much faster so the analogy there is if you're familiar with say CPU and RAM and so on so you have a large pool of RAM and then you have the CPU performing the computation but next to the CPU you have you know at one cache and L2 cache which are much smaller than than dram and much but much faster so you can think of SRAM as like small and fast cash that stays close to the compute unit like physically this is closer and so there is some kind of asymmetry here so hbm is much larger and SRAM is much smaller but much faster and one way of thinking about is how do how can we design algorithms that take advantage of this asymmetric memory hierarchy and of course like lots of folks have been thinking about this back in the I think 1980s when people were um yeah these these ideas were are pretty old so I think back in the 1980s the primary concerns were sorting how can we sort numbers as efficiently as possible and the motivating example was Banks were trying to sort their transactions and that needs to happen overnight so that the next day they can right they can be ready and so the same idea apply which is the they have slow memory which is was which was a disk like hard disk and they have fast memory which was drap and people had to design sorting algorithms that kind of take advantage of this asymmetry and turns out you know these same ideas can apply today's which is different different kinds of memory yeah yeah and in your paper you have kind of like the Pyramid of memory and just to give people an idea when when he says modelers like hbm is like 40 gig and then SRAM is like 20 megabytes yeah so it's not like a little smaller it's like much smaller but the throughput on card is is like 1.5 terabytes a second for HPM and like 19 terabytes a second for SRAM which is a a lot larger yeah how do you think that evolves so the smcs had they hit the scaling limits for SRAM to just cannot grow that that much more yeah um HPM keeps growing hbm3 is going to be 2x faster than hbm2 I think the the latest Nvidia thing as a hbm3 honest um how do you think about the future of like flash attention like do you think hbm is going to get faster enough for like maybe it's not as as useful to use the SRAM or yeah yeah I think I think that's that's right I think it comes down to physics when you know when you design Hardware this is literally SRAM stays very close to the compute unit and so you don't have that much area to essentially put the put the srams for the transistors and you can't shrink these things um too much um so just physics like in terms of area you don't have that much area for the S Ram the hbm um technically is is off chip so there is some kind of uh bus that essentially transfer data from hbm to to the compute unit so you have more area to essentially put these uh um these memory units um and so yeah I think in the future SRAM probably won't get that much um larger um because you don't have that much area each band will get larger and and faster and so I think it becomes more important to um design algorithms that take advantage of this um this memory asymmetry it's the same thing when in in CPU where um the the cache is really small the dram is growing larger and larger dram could get to I don't know two terabytes six terabytes or something whereas the cache stay at like I don't know 50 megabytes or something like that and so um I think maybe the algorithm design becomes more and more and more important there's two ways to take advantage of of this um I think so in the future I think um flash attention you know right now is being used I don't know if um in the next couple years some you know some I've got some new architecture will come in and whatnot but attention seems to be still important for the next couple of years I still expect some of this idea to be useful not necessarily you know the exact code that that's out there but I think these these ideas um have kind of stood the test of time your ideas like IO awareness from back in the 1980s at these like Chrono fusions tiling these are classical ideas that have stood the test of time and so I think in in the future these ideas will become more and more important um as we as we Scale Models to to be larger as we have more kinds of devices where performance and efficiency become much much more important yeah yeah and we had a Jonathan Franco on the podcast and uh if you go to it's attention all you need.com he has an outstanding bat and he does believe that attention will be the the state of the art architecture still in a few years um did you think flash attention would be this popular like I'm always curious on the research side you know you publish a paper and obviously you know it's great work but sometimes it just kind of Falls flat in the industry uh could you see everybody just like just starting to use this or was that a surprise to you yeah so I I think uh yeah um certainly I didn't anticipate the level of popularity of course we we're extremely happy to to have people using this stuff and giving us feedback and so on and help us improve um things I think when when we were writing the the paper I remember sending email to to one of my advisors and like hey I'm excited about this paper but I think the most important thing will be the artifact which is the code um so I knew that like the code will be valuable um and um and so we we kind of focus a lot on on the code and make sure that you know the code is usable and as fast as can be of course the idea the paper presents um the the ideas and explain it and have experiments that valid dates the idea but I can't I I can knew that the artifact of the code was the um was also pretty important um and that turned out to be um kind of the the right Focus which is you know we we put out the paper we released the code and continue working on the on on the code with my um um so yeah is is a team effort with my my co-authors as well yeah we mentioned hazy research a bunch of times on the podcast before um I would love for you to spend five minutes just talking about how does the group work you know how do people get together like how do you like bounce ideas off of each other yeah yeah so hazy research is a research group at Stanford led by uh one of my advisors Chris Ray um and um I I left I love the people there is one of the best experience I had like they've made my PhD so much more enjoyable and I think there are a couple of um a couple ways that um the group has been um has been working pretty well so one is I think there's kind of a a diverse pool of of people who either you know some of them focus on algorithms and and Theory some of them focus on Building Systems some of them focus on applications and as a result there is this flow of idea so as an example um we were working um or some of some of us were working on like more algorithms and and theory and then we can we can talk to the folks building system and say hey let's try it out and let's let's put it in the systems and and see how it is um and and there you will get feedback from systems folks they will say okay we implemented this or like we tried this and you know this is where it doesn't work something like that um and once we put in the systems like the application folks can can use um the the arithm or new methods or new models and we again get great feedback from them because uh the application folks for example some of my good friends they focus on medical imaging or seizure detection and that is the problem they care about right and if your method doesn't work on the tasks they care about they will tell you um whereas I think a lot of people with machine learning they're a little bit more flexible so they will be like hey it doesn't work on seizure detection let's try some other tasks right um but having that direct feedback of like hey it doesn't work there let's figure out why um I think that that feedback allows to do better work um and I think you know that that kind of process of exchanging ideas um validating it on on in a real system so that applications folks can try it out and give you feedback and think that that cycle has been um very very useful um and so that's that's one you know having a diverse group of people um the other one is and this is something I really appreciate um from advice from Chris was um try to understand the fundamental right and then he's happy um letting me go go off and read some textbooks and and playing with things because I think a lot of research ideas come from understanding the Old literature and see how it fits with the um with the new landscape um and so if you just read our new archive papers every day you know that's that's great but um you also need to read textbooks and uh and and that's one advice I got from Chris which is understand the fundamentals and I think that allows to do um you know more impactful work yeah yeah how do you think about um Academia versus industry I feel like AI machine learning has been an area where up until three four years ago most of like The Cutting Edge work was being done in Academia and now you know there's all these big industry research Labs um you're obviously going to to Prince then so you're a Academia believer how should people think about where to go say I'm like uh you know I'm doing my masters I have to decide between doing a PhD and like going to open anthropic how should I decide yeah so I I think they kind of play complimentary role um in in my opinion because I also was considering different paths as as well um so I think um right now scaling matters a lot as especially when you talk about language models and gender of AI and so on scaling matters a lot and that that means that you need compute resources and you need kind of infrastructure and you need Engineers Tab and then so uh you know industry tends to have an advantage when it comes to you know scaling things but a lot of the ideas actually came from Academia so you know let's take um let's take attention which got popular with the transformer in 2017. you know attention actually was has has been around for a while um so I think the first mention was in 2014 a paper from banado and others in Joshua Benja which is coming from from Academia you know a lot of ideas did come from from Academia um and scaling things up of course has been I think open AI has been in great at um scaling things up like that's that was the bet that they they made you know after I think gpt2 so they saw that oh scaling scaling these things up to back then was 1. 1.5 billion parameter it seemed to give you uh amazing capabilities so they really committed to that they really committed to scaling things and that turned out to be um has been a pretty successful bet um so um I think for for Academia you know we're still um trying to figure out exactly what we're doing um in in this shifting landscape right and and so and so lots of folks have been focusing on for example evaluation um so I know um the Stanford center for foundation model that by Percy they have this Benchmark called Helm which is this holistic Benchmark so um trying to figure out okay characterizing the the landscape of different kinds of models what people should evaluate which people should measure and things like that so evaluation is one role um the other one is understanding so um this this uh has has happened historically where there's been some development in in the industry and Academia we can play role in explaining understanding kind of they have the luxury to slow down trying to understand stuff right so lots of paper on um understanding what's really going on probing these models and and so on I think I'm not as familiar with the NLP literature but my impression is there's a lot of that going on in the kind of NLP conferences which is understanding what these models are doing what capabilities they have and so on um and the third one I could see is that um the a claimer can take um more risky bets in the sense that we can work on stuff that um quite they're quite different from from industry I think industry my impression is uh you know you're trying to you have some objective you're trying to say hey for this quarter we want to scale the model in this particular way next quarter we want the model to have this capabilities um and so um you're you're you're hitting your you're you're trying to get objectives that maybe I don't know 70 that will work out and you know because it's important for for the company's Direction um I think for Academia um the the way things work is like you know you have many many researchers or PhD students and and they're kind of pursuing independent directions and they have a little bit more flexibility on hey I'm going to try out this you know seemingly crazy idea um and see let's say there's a 30 chance of success or something right and however you you define success yeah um um for for Academia a lot of the time success just means like hey we we found something interesting right and then um that could eventually go into industry through collaboration and and so on so I do see um Academia and Industry kind of playing complimentary roles um and as for what someone choosing a career I think uh just more more generally um industry would be probably better in terms of compensation in terms of probably work-life balance um and but my buyer's perspective is that maybe academic gives you a little bit more freedom to um think um and understand things um so you know it probably comes down to personal choice um I ended up choosing to um to be a professor next to you at Princeton but of course like I want to maintain um kind of um relationship with industry folks I think industry folks can provide very valuable feedback um to what we're doing in Academia so that we we understand where the field is is moving because of the you know some of the directions are very much the influenced by what for example open AI or Google is doing right so we want to understand where the field is moving what uh what are some promising applications and try to anticipate okay if the fear is moving like this if these are these applications are going to be popular What problems will be important in two three years right and then we try to start thinking about those problems so that hopefully in two three years we have some of the answers to um some of the some of these uh problems in two three years now sometimes sometimes it works out sometimes it doesn't you know but as long as we do interesting things in Academia you know that's that's the goal yeah and you mentioned the email side so we did a benchmarks 101 episode and one of the things we were seeing it's like sometimes the benchmarks really influence the model development you know because obviously if you don't score well on the benchmarks you're not going to get published and you're not gonna you're funded um how do you think about that like how do you think that's going to change now that a lot of the applications of these models again is in more like narrow industry use cases like do you think the goal of like the Academia eval system like to be very Broad and then industry you can do their own evals or what's the relationship there yeah so I think evaluation is important and often um a little bit underrated so it's not like you know as flashy as uh oh we have we have a new model that uh you know can do such and such um but um I think evaluation yeah what you don't measure you can't make progress on essentially um um so I think industry focus is like of course they have specific use cases that that their models needs to do well on and you know that's what they care about I think um for um not just Academia but other groups as well um people uh people do understand what are some of the emerging use cases so for example um you know now one of the most popular use use cases is um is chatbot right and then I think folks from um this organization called um from Berkeley some of them are from Berkeley call mls's they set up this kind of chatbot Arena um to to to essentially Benchmark different models so people do understand what are some emerging use cases people do contribute to like um evaluation and and measurement and as a whole I think people try to contribute to the field and move the field forward albeit that you know maybe slightly different directions but we're making progress and and definitely evaluation and measurement is like one of the like the way one of the ways you you make progress um so I think going forward there's still going to be just more models more evaluation we'll just have better understanding of what these these models are doing and what capabilities they have yeah and I like that your work has been focused on not making benchmarks better but it's like let's just make everything faster like let's so it's very horizontal um so flash attention too you just released that on on Monday um I've read in the in the blog post that um a lot of the work was like also related to like some of the Nvidia Library updates um yeah maybe you've run a bit uh run us through some of those changes and some of the Innovations there yeah yeah for sure so flashes into something I've been working on for the past um couple months and and we've had um it actually so the uh the story is um the the Nvidia Cutlass team um they release a new version of their their Library which contains all these Primitives to allow you to do like you know Matrix multiply or memory loading on GPU efficiently so is a is a great library and I I built on that um so they they released their version um three back in January and I got really excited and I wanted to play with um with that Library so as an excuse I was just like okay I'm going to refactor my code and use this Library so that was that was kind of the start of the of the project um by the end and like I just end up working with the code a whole lot more and I realized that hey there are these inefficiencies still in Flash attention um we can change this way or that way and make it in the end twice as fast but of course you know use uh building on the library that the Nvidia folks released so that was a kind of a really fun um exercise I would say I started I was in fact it's just an excuse for myself to play with with the new library what ended up was like a month several months of um Improvement improving uh flash attention discovering new ideas and in the end we we managed to make it 2x faster and now it's pretty close to um probably the efficiency of things like Matrix multiply which probably is the most optimized subroutine on the planet um so we're we're really happy about the Nvidia Cutlass team has been very supportive and um yeah hopefully in the future we we're going to collaborate more yeah and since it's a Nvidia Library can you only run this on like Cuda runtime so like could you use this and then run in like a AMD GPU yeah yeah so it's an Nvidia Library so um so right now kind of the code we release runs on Nvidia gpus which which is what most people are using to train models of course they're emerging um you know other Hardware as well so the AMD folks did Implement a version of flash attention um I think last year as well and that's that's also available um the um I think there's some implementation on CPU as well for example there's this Library GG ml where they implemented the same kind of the same idea running on Mac and and CPU so I think that kind of broadly the idea would apply the current implementation ended up using nvidia's um library or or Primitives but I I expect the the idea to be broadly these ideas to be broadly uh applicable to different Hardware as long as I think the main idea is you have like asymmetry in in memory hierarchy which tends to be everywhere in you know in a lot of um a lot of accelerators yeah yeah it kind of reminds me of um Sarah hookers uh both like the hardware Lottery yes it's a there could be all these things that are much better like our architecture that are better but they're not better on Nvidia so we're never gonna know if they're actually um improved how how does that play into like some of the research that you all do too yeah so absolutely yeah I think uh um the she she wrote this this piece on Hardware Lottery and that's I think she should capture really well of what a lot of people have been thinking about this and I certainly think about Hardware Lottery um quite a bit given that you know I do some of the work that's it's kind of really low level at the level of hey we're optimizing for gpus or Nvidia gpus and and optimizing for our attention itself um and at the same time I also work on other algorithms and methods and Transformer Alternatives um and and we do see this effect um in play Not Just Hardware Lottery but also kind of software framework um Lottery you know attention has been uh popular for six years now and so many um so many kind of uh engineer hours has been spent on making it as easy and efficient as possible to run Transformer right there's uh there's libraries to do you know all kind of tensor parallel parallel if you use Transformer let's say someone else developed Alternatives or let's just take um recurrent neurons like lstm Gru right and if you if we want to do that and run that efficiently on current Hardware with current software uh framework that's quite a bit harder so in some sense um there is this feedback loop where somehow the model architectures that take advantage of Hardware become popular and the hardware will also kind of evolve to optimize a little bit for that kind of architecture and software framework software Frameworks will also evolve to like optimized for that particular architecture right now Transformer is the the dominant architecture um so uh yeah I'm not sure if there is a good way um out of this of course there's a lot of development things like come I think compilers um will you know play a role because compilers allow you to maybe still be much more efficient across different kinds of Hardware um because essentially you write the same code and compiler will be able to um to make it run efficiently on different kinds of Hardware so for example there's this language Mojo from modular um AI you know they they're compiler experts right and and they're bad is um AI models will run will be running on different kinds of devices so let's make sure that we have compilers really good compilers with a good language that um that then the compiler can do a good job optimizing for all kinds of devices so like that's that's maybe one way that you can get out of this this cycle um but yeah I I'm not sure of a good way now in my own research like I have to think about both the kind of new algorithm new model and how it maps to Hardware um so they're you know the crazy idea is that seem really good but uh will be really really difficult to run efficiently and so as a result um you know for example we can't really scale some of the some of the architectures up simply because they they're not Hardware friendly um so I you know I have to think about both both sides when when um I'm working on on new models yeah have you spent any time looking at some of the new kind of like AI trips companies so to speak like the cerebris of the word like one of their Innovations like you know co-locating everything on the chip so you kind of remove some of this like a memory bandwidth issue yeah um yeah how do you think about that yeah I think that's that's an interesting bet I think uh Tesla also has this uh Dojo super computer um where they try to have as essentially as fast um on-shaped memory as possible and removing some of these um um uh data transfer back and forth um I think that's a promising Direction um the um the issues I could see you know I'm definitely not a hardware expert um one issue is the on-chip memory tends to be really expensive to manufacture like much more expensive per um per gigabytes compared to off-chip memory um so um I talked to you know some of my friends are at cerebras and you know they have their own own stack and compiler and and so on and they can they can make it work um the the other uh kind of obstacle is again with the compiler and and software framework and so on you know for example they can um you know if if you can run pytorch on on this stuff like yeah lots of people will be will be using it but uh um supporting all the all the operations in in pi switch will take uh would take a long time to implement um of course you know people are are working on this so I think yeah we kind of need these different bets on the the hardware side as well Hardware has my understanding it has a kind of a longer um time scale so you need to design Hardware you need to manufacture it you know maybe on the order of three to five years or something like that so um people are taking different bets but the kind of the the AI landscape is changing so fast that is hard to predict okay what kind of models will be dominant in say three or five years or thinking back you know five years ago what would we have known that um Transformer would have been the the dominant architecture maybe maybe not right and so different people will make different bets on on the hardware side yeah that's the pace of the industry and the research also influence the PHD research itself so like for example in your case you know you're working on improving attention it probably took you quite a while to like write the paper and everything but like in the meantime you could have had a new model architecture come out and then it's like nobody cares about attention anymore um how do people balance that yeah so I think it's tough it's definitely tough um for PhD students for researchers um given that the field is moving really really fast [Music] um I think it you know comes down to understanding fundamental um because that's essentially for example what the PHD allows you to do it's been a couple years um understanding the the fundamentals so for example when I started my PhD I was working on um understanding Matrix Vector multiply which is you know it's a very it's been a concept that's been around for hundreds hundreds of years we were trying to characterize characterize what kind of matrices would have theoretically fast multiplication algorithm um that seems to have nothing to do with you know AI or anything um but that was that was a um I think that was a time when kind of I did I developed kind of mathematical maturity and and and research taste and research skill um yeah it doesn't doesn't the the the research topic at that point didn't have to be like super trendy or anything as long as I'm developing skills as a researcher and making progress and eventually um uh I've gotten you know quite a bit better in terms of like research skills right and that allows um for example students later in their their career to um kind of uh quickly develop um solutions to whatever you know problems they're they're facing um so I think that's just the natural Arc of like of like how you're being trained as a researcher um for uh for a lot of PhD students I I think given the the pace is is so fast maybe it's harder to justify spending a lot of time on the fundamental and you know it's tough like what is it's kind of explore exploit kind of a kind of dilemma and I don't think there's a there's a there's a universal answer um so I personally spend some time doing this kind of exploration you know reading random uh textbooks or lecture notes and I spent sometimes keeping up with the latest uh architecture or methods and and so on I don't know if there's a right balance it depends on on uh from it varies from person to to person but if you only spend 100 on one um either you only do exploration no only do exploitation I think it probably won't work in the long term it's probably gonna have to be a mix and you have to just experiment and kind of uh be introspective and see hey I try this kind of mixture of I don't know one Explorer exploration paper and one exploitation paper like how did that work out for me should I you know having conversation with with for example my advisor about like hey did that work out you know should I shift I focus more on on one or the other like I think quickly adjusting that's and focusing on on the process I think that's probably the right way I don't have like a specific recommendation that hey you focus I don't know 60 on on lecture notes and 40 on archive papers or anything like that um let's talk about some Transformer Alternatives you know say Jonathan Franco loses his bet and uh Transformers not the state of the art architecture what are some of the candidates to to take over yeah so this is a uh it's uh this bet is quite fun so this my understanding is this bet between um Jonathan Frankel and uh Sasha rat right um and you know I've I've talked to through Sasha a bunch and um I think he recently gave an excellent tutorial on kind of Transformer Alternatives as well so I would recommend that um so just to quickly re uh you know quickly recap I think there's been uh quite a bit of of development more recently about um Transformer Alternatives so architectures that are not Transformer right um and um the question is can they do well on for example language modeling which is kind of the uh the application that a lot of people care about um these days um so um there are there are methods based on um kind of state space methods like that came out in 2021 from from Albert goo and Curran and and Chris Ray um that are you know presumably could do much better in terms of capturing long-range information um while not scaling quadratically they scale Sub quadratically in terms of sequence length so potentially you could have a much more efficient architecture when sequencing gets uh really long um the other ones has been focusing more on recurrent neural Nets which is again a no idea but you know adapting to the the kind of the new landscape so things like rwkv we've also I've also personally worked on on this in this space as well um so there's been some promising results so there's been some results here and there that show that hey we these Alternatives either RNN or state space methods can match the performance of of Transformer on language modeling so that's really exciting and we're starting to understand um on the on the academic research side we understand like do we really need attention right and that's I think that's a valuable kind of intellectual thing to understand um and um maybe we do maybe we we don't um but if we want to know if we need to spend serious effort on um on trying the Alternatives and there's been uh folks pushing on this direction I think rwkv scale up to they have a model at 14 billion that seems pretty competitive with with Transformers so that's really exciting um um and uh so you know that's kind of an intellectual uh you know thing we want to figure out if attention is necessary yeah so that's one motivation um the other motivation is I think Alternatives uh transform alternative could have an advantage um in practice in some of the use cases so um one use case is really long sequences the other is really high throughput of generation so for really long sequences when you train with Transformer with flash attention and so on it's still you know the computation is still quadratic in the sequence Lane so if your sequence laying is on the order of I don't know 16k 32k 100K or something which some of these models have sequence 100K um then you do get significantly uh slower in terms of training also in terms of inference so maybe these alternative architectures could scale better um in terms of sequence length I haven't seen uh you know actual validation on this it's in like uh I say an RN model release with context playing I don't know 100K or something I haven't really seen that but um the promise or or the Hope could be that as we scale through long sequences um these alternative architecture could be more Well Suited um not just text but things like high resolution images audio video and so on um which are you know emerging applications so that's one long sequences number two is a high throughput generation where I can imagine scenarios where um the application isn't like an interactive chat bot but let's say a company wants um to batch as many requests as possible on the on their server or like they're doing offline processing they're you know generating stuff based on their internal documents that you need to process in in batch right and the issue with Transformers that uh to during generation essentially needs to keep around all the previous uh history the KV cache and that could take a significant amount of memory so you can't really batch um too much because you you run out of memory um for other I am personally bullish on on RNN I think rnns um they don't they essentially summarize the past into a a a state Vector they have fixed size so the size doesn't grow with the history um so that means that you don't need as much memory to keep around all the previews tokens um and as a result I think you can scale to much higher batch sizes and as a result you can much make much um more efficient use of the of the gpus or the accelerator and you could have much higher generation throughput now this has I don't think has been validated uh at scale so as a researcher you know I'm bullish on this stuff because I think in the next couple of years like these these are use cases where these Alternatives could have an advantage um we'll just kind of have to to wait and see to to see if these you know these these things um well what happened I am personally bullish on on this stuff at the same time I also like spend a bunch of time with making attention as as fast as possible so uh can I play I I kind of maybe hatching I'm I'm playing both sides yeah um ultimately we want to understand um as researcher we understand what works why do the models why do the models have these capabilities and one way is let's push attention um to uh to be as efficient as possible on the other hand let's push other alternatives to be as efficient as as we can scale as big as possible and and so that we can kind of compare them and understand yeah awesome and I think as long as all of this work happens in the open it's you know and that positive for everybody to explore the right all the paths uh yeah let's talk about open source AI obviously together you know when red pajama came out which was a you know an open clone of like the Llama one um pre-training data set it was a big thing in the industry llama 2 came out out on Tuesday I forget and this week there's been a lot of things going on yeah um which you know they call Open Source but it's not really open source um actually wrote a post about it that was on the front page of accurate news before this podcast so I was frantically responding how do you think about what open source AI really is you know like in my mind there's in open source software we have different levels of open so there's like free software that's like the GPL license there's open source which is a Apache MIT yeah and then there's kind of like restricted open source which is the sspl and some of these other licenses yeah in AI you have the open models so red pajama is an open model because you have the pre-training data side you have the training runs and everything right and then there's obviously random lens that doesn't make it one to one if you retrain it then you have the open weights model that's kind of like um stable LM where the weights are open but the data side is not open right and then you have um Lama two which is the data set is not over and the weights are restricted it's kind of like not really open source you know but uh open enough I think it's that positive because it's like three million dollar flops like donated to the public you know yeah um how do you think about that and also like as you work with the getter you know whether it's uh your philosophy with open source AI right right yeah I think that that's that's a great question and um I I I I think about it on maybe more practical um terms so um you know uh of course like meta has done an amazing job training llama one lemon two and for llama two they um kind of make it much less restrictive compared to llama ones where um now um you can use it for businesses unless you are however 700 million monthly to user or something like that um I I I think just this change will have a very significant impact in the kind of landscape of Open Source AI where now lots of businesses lots of companies will be using I I I expect we'll be using things like gamma 2. they will fine tune on their own data set they will be serving um you know variants or derivatives of of lamba 2. whereas before you know with llama one it's a really good model but your business companies weren't allowed to do that so I think on more practical term is kind of Shifting the balance between kind of close horse model like open Ai and and anthropic and Google where you're making API calls right and you maybe you don't understand as much of the what what the model is doing um how the model is changing and and so on versus now we have a you know an a a model with open weight that is uh you know pretty competitive from uh from what I've seen in terms of benchmarks pretty competitive with gbt 3.5 right and if you fine-tune it on your own data maybe it's more well suited for your own data um and I I do see that's going to shift the balance of like more and more folks are going to be using um let's say derivatives of lamba 2. one of our folks are going to fine tune and serve their own model instead of calling an API um so I think that that shifting of balance is importa

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Latent Space · Latent Space · 2 of 60

← Previous Next →

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Ep 18: Petaflops to the People — with George Hotz of tinycorp

FlashAttention-2: Making Transformers 800% faster AND exact

FlashAttention-2: Making Transformers 800% faster AND exact

RWKV: Reinventing RNNs for the Transformer Era

RWKV: Reinventing RNNs for the Transformer Era

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

RAG is a hack - with Jerry Liu of LlamaIndex

RAG is a hack - with Jerry Liu of LlamaIndex

The End of Finetuning — with Jeremy Howard of Fast.ai

The End of Finetuning — with Jeremy Howard of Fast.ai

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Four Wars of the AI Stack - Dec 2023 Recap

The Four Wars of the AI Stack - Dec 2023 Recap

The State of AI in production — with David Hsu of Retool

The State of AI in production — with David Hsu of Retool

Building an open AI company - with Ce and Vipul of Together AI

Building an open AI company - with Ce and Vipul of Together AI

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Making Transformers Sing - with Mikey Shulman of Suno

Making Transformers Sing - with Mikey Shulman of Suno

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Why Google failed to make GPT-3 -- with David Luan of Adept

Why Google failed to make GPT-3 -- with David Luan of Adept

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Breaking down the OG GPT Paper by Alec Radford

Breaking down the OG GPT Paper by Alec Radford

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

LLM Asia Paper Club Survey Round

LLM Asia Paper Club Survey Round

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How AI is Eating Finance - with Mike Conover of Brightwave

How AI is Eating Finance - with Mike Conover of Brightwave

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

State of the Art: Training 70B LLMs on 10,000 H100 clusters

State of the Art: Training 70B LLMs on 10,000 H100 clusters

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Synthetic data + tool use for LLM improvements 🦙

Synthetic data + tool use for LLM improvements 🦙

RLHF vs SFT to break out of local maxima 📈

RLHF vs SFT to break out of local maxima 📈

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Answer.ai & AI Magic with Jeremy Howard

Answer.ai & AI Magic with Jeremy Howard

Is finetuning GPT4o worth it?

Is finetuning GPT4o worth it?

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Building AGI with OpenAI's Structured Outputs API

Building AGI with OpenAI's Structured Outputs API

Q* for model distillation 🍓

Q* for model distillation 🍓

Finetuning LoRAs on BILLIONS of tokens 🤖

Finetuning LoRAs on BILLIONS of tokens 🤖

Cursor UX team is CRACKED 💻

Cursor UX team is CRACKED 💻

Choosing the BEST OpenAI model 🏆

Choosing the BEST OpenAI model 🏆

How will OpenAI voice mode change API design?

How will OpenAI voice mode change API design?

STEALING OpenAI models data 🥷

STEALING OpenAI models data 🥷

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

Prompt Engineer is NOT a job 📝

Prompt Engineer is NOT a job 📝

Prompt Mining LLMs for better prompts ⛏️

Prompt Mining LLMs for better prompts ⛏️

The six pillars of few-shot prompting 🔧

The six pillars of few-shot prompting 🔧

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Can you separate intelligence and knowledge?

Can you separate intelligence and knowledge?

The video discusses FlashAttention-2, a technique that makes Transformers 800% faster and exact, and its applications in efficient Transformer training and inference, long range sequence models, and exact attention. It also covers the collaboration between the Hazy Research group and industry, and the importance of evaluation and measurement in emerging use cases like chatbots. Viewers can learn how to apply kernel fusion and tiling, optimize Transformer training and inference, and fine-tune lan

Key Takeaways

Apply kernel fusion and tiling to optimize attention computation
Use Nvidia Cutlass library and Cuda runtime to optimize matrix multiplication and memory loading on GPU
Implement FlashAttention-2 to make Transformers 800% faster and exact
Evaluate and measure model performance in emerging use cases like chatbots
Fine-tune language models with FlashAttention-2

💡 FlashAttention-2 makes Transformers 800% faster and exact by applying kernel fusion and tiling to optimize attention computation, and can be used to fine-tune language models and evaluate and measure model performance in emerging use cases like chatbots.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

We ran an AI 'peer organization' (Claude + Codex + Gemini) for 7 weeks. Here is the operational record.

Learn from a 7-week experiment running an AI 'peer organization' with Claude, Codex, and Gemini, and discover key takeaways on operational records and AI collaboration

How Does ChatGPT Understand Your Question?

Learn how ChatGPT comprehends user queries, enabling effective interactions with the AI model

Medium · Machine Learning

How Does ChatGPT Understand Your Question?

Learn how ChatGPT understands your questions and generates relevant responses

Medium · Programming

AI Update — June 30, 2026: 5 Things That Just Dropped

Stay updated on the latest AI developments with 5 new releases, and learn how to apply them in your work

Chapters (12)

Tri's background

2:18 FlashAttention’s deep dive

17:21 How the Hazy Research group collaborates across theory, systems, and applicati

25:00 Evaluating models beyond raw performance

27:00 FlashAttention-2

30:00 CUDA and The Hardware Lottery

35:00 Researching in a fast-changing market

37:30 Promising transformer alternatives like state space models and RNNs

43:00 The spectrum of openness in AI models

47:12 Practical impact of models like LLAMA2 despite restrictions

49:43 Incentives for releasing open training datasets

53:22 Lightning Round

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)