Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner
Skills:
Tool Use & Function Calling90%Prompt Craft80%Prompt Systems Engineering70%Agent Foundations60%Multi-Agent Systems50%
Key Takeaways
The video discusses MAX, a modular AI framework that accelerates developer productivity across CPU and GPU, and its features, such as easy deployment, inference endpoints, and fine-tuning models, as well as its comparison to other frameworks like PyTorch and TensorFlow. MAX is also shown to have a new programming language called Mojo, which combines the benefits of Python with a high-performance compiler.
Full Transcript
[Music] all right good morning everyone here to talk to you about modular and accelerating the pace of AI uh you know what ji is I'm not going to tell you all about this let me tell you one of the things I think is really cool about it and very different than certain other Technologies is that it's super easy to deploy there's lots of great endpoints out there there's a lot of good implementations a lot of ways to make it super easy to build a prototype and get get going very quickly but despite all the availability of all these different endpoints sometimes you do have other needs sometimes you might want to go you know and control your data instead of sending your data to somebody else sometimes you might want to integrate it into your own security because you got your critical company data in your model and you don't want to fine-tune it somewhere else sometimes you want to imiz the model like there's research happening all the time right a lot of things in building proprietary models work best for your use cases uh can make your applications even better and of course the inference end points are expensive and so sometimes you want to save money sometimes there's Hardware out there that's really interesting and you want to explore out from the the mainstream and you want to go do this and if you care about any of these things what you need to do is you need to go beyond the endpoint and so how do you do that well if you have many of you have explored this I'm sure the answer shifted it used to be that we had things like pie torch and tensor flow and cafe and things like this but as inference became more important the world shifted first we got Onyx tensorrt things like this and today we have an explosion of these different Frameworks some of which are specific to one model that's cool if you care about that one model but if you have many different things you want to deploy and you want to work with it's very frustrating to have to switch between all all these different Technologies and of course it's not just the model you all know there's this gigantic array of different technologies that get used to build real world things and production and of course none of these are really actually designed for Gen so my my my concern about this my my objection to the status quo is that this fragmentation slows down getting the research and the Innovations coming into gen into your products and I think we've seen so many of these demos and last year was really the year of the Gen demo but still we're struggling to get gen into product in an economical and good way and so whose fault is it well is it our fault like many of you are AI Engineers if you don't let's sympathize with the plate of the AI engineer because y'all these folks they're building this have new models and optimizations coming out every week right every product needs to be enhanced with Gen this is not like one thing we're getting dumped on and there's so much to do we can't even keep up there's no time to deal with new hardware and all the other exciting new features and of course once you get something to actually works the costs end up making it very difficult to scale these things because getting getting things into production means suddenly you're paying on a per unit basis so it's not the a engineer fault we should look at the concerns and look at the challenges faced here and so I think that we need a new new approach right we've learned so much let's look at what we need to do how do we solve and improve the world here this is what modular is about and so I'll give you a quick intro of what we're doing and kind of our approach on this first of all who are we modular is a fairly young company we've been around for a couple of years um we we have brought together some of the world's experts that built all of these things and so we've built tensorflow and pytorch we've built compilers like lvm and ML and xlaa and all all of these different things and so what what I can say about that is that we learned a lot and I apologize because we know why is so frustrating to use all these things but but really it's it was a you know the world looked very different 5 years ago gen didn't exist it's it's understandable we we tried really hard but but we've learned and so what our goal is is to make it so you can own your AI you can own your data you can control your product you can deploy where you want to you can do this and it make it much easier than the current systems work today and so how well what we're doing is really going back to to the basics we're going to we're bring together the best-in-class Technologies into one stack not one solution per model our goal is to lift python developers pytorch users this is where the entire industry is and so we want to work with existing people we're not trying to like say hey ditch everything you know and try something new we want to gradually teach and give folks new tools so they can be superpowers so they can have superpowers and finally uh so I spent a lot of time at Apple like I want things it just work like you want to build on top of infrastructure you do not want to have to be experts in the infrastructure and this is the way all of this stuff should work and unfortunately it's just not the case today in Ai and so modular we're building is technology called Max I'll explain super fast what this is um max is two things one is an AI framework which I'll spend a bunch of time about uh the AI framework is free widely available we'll talk about today the other is our managed services this is how modular makes money very traditional we're not going to spend a lot of time talking about that today and so if you dive into this AI framework well it's we see it is two things it's the best way to deploy pytorch it's also the best way to do geni and both halves of this are really important and Max is currently very focused on inference and so these are areas where uh pytorch is challenging at times this is where gen is driving us crazy with cost and complexity and so really focusing on this problem is something that we are uh we're all about the other thing as I said before is python so we natively speak python that is where the entire world is we also have other options including C++ which we'll talk about later so how do we approach this well as I said we work with pytorch out of the box you can bring your models your model works we can talk to the wide array of py Torchy things like Onyx and torch script and torch compile and like all this stuff and so you can pick your path and and that's all good uh if you want to go deeper you can use Native apis Native apis are great if you want if you speak the language of KV caches and Page detention and things like this and you care about pushing the state-ofthe-art of llm and other geni techniques that's very cool and also um max is very different in that it really rebuilds a ton of the stack which I don't have time to talk about but um we do not build on top of CNN and the the Nvidia libraries and on top of the Intel libraries we replace all that with a single consistent stack which is really approach and I'll talk about what that means later and so what you get is you get a whole bunch of technology that you don't have to worry about and so again as a Next Generation technology you get a lot of fancy compiler Technologies runtimes high performance kernels like all this stuff uh in the box and you don't have to worry about it which is really the point now why would you use max so it's it's a AI framework you have one right and so there are lots of different reasons why people might want to use an alternative ative thing um for example developer velocity your team being more productive that's actually incredibly important particularly if you're pushing state-of-the-art but it's also very hard to quantify and so I'll do the same thing that you know kind of people generally do is go and talk about the quantifiable thing which is performance and so I'll give you one example of this um we just shipped a release that has our in4 in6 K fancy quantization approach um this is actually 5x faster than one CPP and so if you're using ll. CPP today on in Cloud CPUs this is actually a pretty big deal in 5x can have a pretty big impact on um you know the actual perceived latency of your product and performance and cost characteristics and the way this is possible is again this combination of really crazy compiler and technology and other stuff underneath the covers but the fact that you don't have to care about that is actually pretty nice it's also pretty nice that this isn't just one model this is you we have this make it easy to do in4 technology and then we demonstrate it with a model that people are very familiar with and so if you care about this kind of stuff this is this is actually pretty interesting and it's a Next Generation approach to a lot of the things that are very familiar but it's also done in a generalizable way now CPUs are cool and so I mean so far we've been talking about CPUs but gpus are also cool and what I would say and what I've seen is that the uh uh CPUs and AI are are kind of well understood but gpus are where most of the pain is and so I'll talk just a little bit about our approach on this and so first before I tell you what we're doing let me tell you our dream and this is this is not a small ambition this is kind of a crazy dream imagine a world where you can program a GPU as easily as you can program a CPU in Python okay not C++ in Python that that that that is that that is a very different thing than the world is today imagine a world in which you can actually get better utilization from the gpus you're already paying for I don't know your workload but you're probably somewhere between 30% maybe 50% utilization which means you're paying for like two to three times the amount of GPU that you should be right and that that is understandable given the technology today but that's not great for lots of obvious reasons imagine a world where you have the full power of kudo so you don't have to say there's a powerful thing and there's an easy to use thing you can have one technology stack that scales well this is something that is really hard this is something where you know Nvidia has a lot of very good software people and they've been working on this for 15 years um but I don't know about you I don't run 15E software on my cell phone like it doesn't doesn't run Blackberry software either and I think that it's time to really rethink this technology stack and push the world forward and that's what we're trying to do and so how does it work well you know it's just like pyour you use one line of code and switch out CPU to GPU haha we've all seen this right this this doesn't say anything I actually hate this kind of a a demo um because the way this is usually implemented is by having a big fork at the top of two completely different technology Stacks one built on top of Intel mkl one built on top of Kuda and so as a consequence nothing actually works the same except for the the thing on the slide and so what mod done here is we've gone down and said let's replace that entire layer of Technology let's replace the Matrix multiplications let's replace the fuse detention layers let's replace the graph thingies let's replace all this kind of stuff and then make it work super easily super predictably and let's make it all stitch together and yeah it looks fine on a slide but the slide is missing the point so if you are an advanced developer and so many of you don't want to know about this and that's cool if you are an advanced developer like I said you get the full power of Cuda and so if you want you can go write custom kernels directly against Max and that's that's great and for advanced developers which I'm not going to dive too deeply into it's way easier to use than things like the uh Triton language and things like this and it has good developer tools and it has all the things you'd expect from a worldclass implementation of GPU programming technology um for people who don't want to write kernels you also get a very fancy autof fusing compiler and things like this and so you get good performance performance for the normal cases without having to write the hand fuse kernels which is again a major usability Improvement now you know it's cool like there's a lot of lot of things out there that the promise to be easy but what about performance right a lot of the reason to use the GPU in the first place is about performance and so one of the things I think is pretty cool and one of the things that's very important to modular is that we're not comparing against low standards we're comparing against the vendor's best in this case Nvidia their experts in their architecture and so if you go look at again there's a million ways to measure things a micro Benchmark go look at the core operation within a neural network matrix multiplication this is the most important thing for a wide variety of workloads and again one set of data but we compare against kuas the hard-coded thing and then also against cutless the more programmable C++ e thing and so max is uh meeting and beating both of these you know by just a little bit I mean it's you know it depends on your bar and data is complicated but you know if you're winning by 30% 30% is actually a pretty big deal given the amount of cost the amount of complexity the amount of effort that goes into these kinds of things and so I've talked a lot about the what but I haven't talked about the how and so the how is actually a very important part of this and I'll just give you a sample on this so we are crazy enough that we decided to go rebuild the world's first AI stack from the bottom up for Gen and as part of doing that what we realized is we had to go even deeper and so we built a programming language we have a new programming language it's called Mojo and so the thing about Mojo is if you don't want to know about Mojo you don't have to use Mojo you can just use max it's fine but we had to build Mojo in order to build max I'll tell you just a couple of things about this our goal is that Mojo is the best way to extend Python and that means the can get out of C C++ and rust and so what is it as a programming language it's a full it's pythonic so it looks like python it feels like python everything you know about python comes over and you canot have to retrain everything which is a really big deal you get a full tool chain you can download on your computer you can use inv Visual Studio code it's open source available on Linux Mac Windows 200,000 people 20,000 people in Discord it's it's really cool would love for you to go check it out if you're interested in this but what is Mojo like what what what actually is it fine there's a programming language thing going on well what we decided is we decided the AI needs two things it needs everything that's amazing about python this is in my opinion the developers this is the ecosystem this is the libraries this is the community this is even sorry the package managing and like all the things that people are used to using already those are the things that are great about python but what is not great about python python unfortunately is its implementation and so what we've done is we've combined the things that are great about python with some very fancy High fluen compiler stuff ml all all this good stuff that then allows us to build something really special and so while it looks like python please do forget everything you know about python because this is this is a different beast and I'm not going to give you a full hourong presentation on Mojo but I'll give you one example of why it's a different Beast pull back something many of you care about which is performance and what I'll say is that Mojo is fast how fast well it depends like this isn't a slightly faster python this is a working back from the speed of light of Hardware kind of system and so many people out there have found that it's a th 100 times to a thousand times faster in Crazy cases it can be even even better than that but the speed is not the point the point is what it means and so in Python for example you should never write a for Loop python is not designed for writing for Loops if you care about performance at least in Mojo you can go write uh code that does arbitrary things this is an example pulled from our llama 3 written in Mojo that does tokenization using a standard algorithm chasing link lists has if statements for Loops like it's just normal code and it's python I me it feels like Python and that that is really the point and so for you the benefit of Mojo is first of all you can ignore it if you don't want to care about it but if you do do you don't have to learn C C++ you have lower cost by default versus python because performance is cost it means that as a researcher if you use this you can actually have full stack hackability and if you're a manager it means that you don't have to have people that know rust on your team and C++ and things like this you can have a much more coherent engineering structure where you're able to scale into the problem no matter where it is and so if you want to see something super polarizing go check the modular blog and we'll explain how it's actually fast than rust which may people consider to be the gold standard even though it's again a 15-year-old language so I have to wrap things up they'll get mad at me if I go over um the the thing that I I'm here to say is that many of you may want to go beyond the AP the API and they're fantastic there's amazing technology out there I'm very excited about them too but if you care about control over your data you want to integrate into your your security you want customization you want save money you want portability across Hardware then you need to get on to something else and so if you're in these things then Max can be very interesting to you Max is free you can download today it's it's totally available go nuts uh we didn't talk about production or deployment or things like this but if you want to do that we can also help we support production deployment on kubernetes sag maker and we can make it super easy for you our GPU support like I said is actually really hard we're working really hard on this we want to do this right and so it'll launch officially in September if you join our Discord you can get Early Access and we'd be very happy to work with you ahead of that too uh we're cranking out new stuff all the time and so if you are interested in learning more you can check out mod.com find us on GitHub a lot of this is open source and join our Discord thank you everyone [Music]
Original Description
Today's leading generative AI applications have workloads that span high performance GPU compute, CPU preprocessing, data-loading, and orchestration — often spread across a combination of Python, C++/Rust, and CUDA C++ — which increases the complexity and slows down the cycle of innovation. This talk explores the capabilities and power of the Modular Mojo programming language and Modular Accelerated Xecution (MAX) platform, which unifies CPU and GPU programming into a single Pythonic programming model that is simple and extensible. This results in reduced complexity and improved developer productivity, and streamlines innovation. We'll walk through CPU and GPU support with real-world examples, providing details of how AI application developers can use MAX and Mojo to define an end-to-end AI pipeline and overcome the complexities.
Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025
About Chris
Chris Lattner is a co-founder and the CEO of Modular, which is building an innovative new developer platform for AI and accelerated compute. Modular provides an AI engine that accelerates PyTorch and TensorFlow inference, as well as the Mojo🔥 language, which extends Python into systems and accelerator programming domains. He has also co-founded the LLVM Compiler infrastructure project, the Clang C++ compiler, the Swift programming language, the MLIR compiler infrastructure, the CIRCT project, and has contributed to many other commercial and open source projects at Apple, Tesla, Google and SiFive.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from AI Engineer · AI Engineer · 45 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
▶
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
AI Engineer Summit 2023 — DAY 1 Livestream
AI Engineer
AI Engineer Summit 2023 — DAY 2 Livestream
AI Engineer
Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)
AI Engineer
Announcing the AI Engineer Network: Benjamin Dunphy
AI Engineer
The 1,000x AI Engineer: Swyx
AI Engineer
Building AI For All: Amjad Masad & Michele Catasta
AI Engineer
The Age of the Agent: Flo Crivello
AI Engineer
See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman
AI Engineer
Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase
AI Engineer
Pydantic is all you need: Jason Liu
AI Engineer
Building Blocks for LLM Systems & Products: Eugene Yan
AI Engineer
The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer
AI Engineer
Climbing the Ladder of Abstraction: Amelia Wattenberger
AI Engineer
Supabase Vector: The Postgres Vector database: Paul Copplestone
AI Engineer
[Workshop] AI Engineering 101
AI Engineer
The Hidden Life of Embeddings: Linus Lee
AI Engineer
[Workshop] AI Engineering 201: Inference
AI Engineer
The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex
AI Engineer
The AI Evolution: Mario Rodriguez, GitHub
AI Engineer
Move Fast Break Nothing: Dedy Kredo
AI Engineer
AI Engineering 201: The Rest of the Owl
AI Engineer
Building Reactive AI Apps: Matt Welsh
AI Engineer
Pragmatic AI with TypeChat: Daniel Rosenwasser
AI Engineer
Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan
AI Engineer
Retrieval Augmented Generation in the Wild: Anton Troynikov
AI Engineer
Building Production-Ready RAG Applications: Jerry Liu
AI Engineer
120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson
AI Engineer
The Weekend AI Engineer: Hassan El Mghari
AI Engineer
Harnessing the Power of LLMs Locally: Mithun Hunsur
AI Engineer
Trust, but Verify: Shreya Rajpal
AI Engineer
Open Questions for AI Engineering: Simon Willison
AI Engineer
Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD
AI Engineer
GPT Web App Generator - 10,000 apps created in a month: Matija Sosic
AI Engineer
Using AI to Build an Infinite Game: Jeff Schomay
AI Engineer
How to Become an AI Engineer from a Fullstack Background - Reid Mayo
AI Engineer
The Code AI Maturity Model and What It Means For You: Ado Kukic
AI Engineer
AI Engineer World’s Fair 2024 - Keynotes & Multimodality track
AI Engineer
From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet
AI Engineer
The Making of Devin by Cognition AI: Scott Wu
AI Engineer
The Future of Knowledge Assistants: Jerry Liu
AI Engineer
Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney
AI Engineer
Open Challenges for AI Engineering: Simon Willison
AI Engineer
Lessons From A Year Building With LLMs
AI Engineer
From Software Developer to AI Engineer: Antje Barth
AI Engineer
Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner
AI Engineer
Copilots Everywhere: Thomas Dohmke and Eugene Yan
AI Engineer
Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han
AI Engineer
Low Level Technicals of LLMs: Daniel Han
AI Engineer
Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta
AI Engineer
How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou
AI Engineer
What's new from Anthropic and what's next: Alex Albert
AI Engineer
Using agents to build an agent company: Joao Moura
AI Engineer
Decoding the Decoder LLM without de code: Ishan Anand
AI Engineer
Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner
AI Engineer
Building with Anthropic Claude: Prompt Workshop with Zack Witten
AI Engineer
Building Reliable Agentic Systems: Eno Reyes
AI Engineer
10x Development: LLMs For the working Programmer - Manuel Odendahl
AI Engineer
Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner
AI Engineer
Hypermode Launch: Kevin Van Gundy
AI Engineer
Git push get an AI API: Ryan Fox-Tyler
AI Engineer
More on: Tool Use & Function Calling
View skill →Related Reads
📰
📰
📰
📰
I Built 3 SaaS Products in 6 Months Using AI — Only 1 Got Users. Here's What the Other 2 Taught Me About Distribution vs. Features
Dev.to · Jack
The One Job, One Income and One Platform Method Wil Be The Downfall Of Millions In 2027
Medium · SEO
Ghosting Isn’t Rejection. It’s a Sequencing Problem.
Medium · Startup
The Future of Leadership: Human, Culture, and Authenticity in the Digital Era
Medium · AI
🎓
Tutor Explanation
DeepCamp AI