Lightning Talk: KernelBot: The World's First Competitive GPU Programming Platform - Mark Saroufim

PyTorch · Beginner ·🧠 Large Language Models ·8mo ago

Key Takeaways

Introduces KernelBot, a competitive GPU programming platform

Full Transcript

Hey folks, this is like if you're around like for the earlier talk, this is like very much in the same theme. Uh I want to talk today about like kernelbot which is a competitive platform for writing like fast GPU code. And this was like a sort of very much like an open source collaboration that I worked on with like a whole bunch of people like like Alex was in the room like mate who's here. I won't make you wave your hand again. Uh Eric, like Elaine and myself. So uh yes. Okay. See they all waved. Mate, it's not a big deal. You can wave. All right. So the the the sort of like core motivation behind building a competitive GPU programming platform was that uh we noticed like there was hardly any interesting kernel data on GitHub. The data that's interesting is either uh you know under a license where you can't train a model with it or if it does have a it does have a permissive license then it's like slop. And so we we basically uh had been interested in like taking we had some early work for example with this project called kernel LM where what we did was we used like compiler generated data to train an LLM. This works really well but then the data looks like it was compiler generated. It doesn't look like a human actually wrote it. So again how can we like flood the internet with like more highquality GPU kernels and get like LMS to be like a bit better at solving this problem. All right. So um basically here like the the sort of benefit we had with GPU mode is like GPU mode is a fairly active uh discord community with like over 20,000 people at this point. People nerd out on all sorts of topics related to GPU programming from like weekly lectures like working groups. Um and so what we wanted to do was like we wanted to convince like more of these smart people on the server to donate to us like data for free. And the way like the right way to do this is like we we give them like money with price pools and they give us their expertise back. So, it's like a nice like it's a nice and even exchange. All right. So, the problem now is like, okay, well, you know, we have 20,000 people. How do we give them all a GPU? That's very expensive. Even H100s are cheap. They're $1.90 an hour, but we can't give 20,000 GPUs to people. Um, so we wanted something that's like a time shared like basically we uh we have like one pool of GPUs and then whenever you submit a job, you get access to a GPU, you run your kernel, you get results back. The problem with most traditional setups for this is that like it takes a long time to queue a job like on the order of 12 two minutes to three minutes for like most serverless startups. Uh and so our solution was like to have a workflow that takes closer to like 15 seconds or so where you don't even have access to the hardware but you submit like your code like in this case you would attach a file within Discord directly and you get like a result back like within 15 seconds. So we wanted to be like fast and interactive in the same way that like good video games are developed, right? So like I said, we we had this like existing community that we could use. It was like fun. Like for instance, uh there was like an earlier question from Georgie on thermal throttling. Well, you know what one person figured out was uh they could like basically submit a kernel at the right time of day or depending on load and use that to get a slightly different kernel. So this is great. Like we we we love the creativity. Um and yeah and although like Discord was quite nice in the beginning because we had this large community like it is a bit clunky because every time you want to submit a new kernel you have to attach another file and refill a bunch of like arguments like which leaderboard was this which GPU was this and so this was like all quite uh clunky. So what mate built is like something really neat. It's basically a CLI where you can just say like popcorn CLI submit with your file and then you can just submit a kernel. Uh so like at the top half of our your monitor you'll have VS code and the bottom half you'll have your terminal. You know we use Rust. It doesn't matter because I don't think Rust is the bottleneck here but like Rust is cool. Uh the main problem is like the there's like less community interaction here but it is eventually what like most of our advanced users ended up picking up. Um we also have like a very nice website that like Elaine here in the audience like built for us. So it's like very similar. You can come you can see the leaderboard. You can see who's like like a very top entry. And similarly, you can click on a specific problem uh and then like submit stuff like directly like within the website. So this is like over the long term what we expect to be the primary experience people to have on the server. Um so all of these make it nice like people can just use whatever they want and we're quite fortunate that a lot of our smartest hackers would make writeups describing how how good their solutions are. So we have an architecture diagram but like you know it doesn't it's a bit boring. Let's skip it. All right. So um the sort of quick things we learned here is that like uh designing hard interesting problems is tricky like you want a problem that's interesting. Uh you have to pick like meaningful problems and shapes as well. You can't like pick very small ones because then you just like measure overhead. We also noticed that like sometimes the data distribution could be gamed. For example, if you wanted to produce the world's fest vector mean kernel and you're doing like random inputs with mean zero. Uh well the result is almost always zero. And so that's the world's fastest vector mean kernel. that's bad. So, we want like basically our problem distributions to avoid these kinds of problems. So, we have like a whole bunch of tests. You know, of course, the trade-off here is that this increases the the time and it it wrecks the interactability a bit. Um the other thing was like security like we're basically distributing like very valuable assets to the rest of the world. Uh you know, and people can submit Python files and like Python is an interesting language like for instance in Python you can within a Python file pip install stuff and it like works. you can like start an SSH server like it's you can start a server that also works. So we we had to do a lot of work to avoid hacky submissions. Like one source of hacky submissions is also people figuring out how to game our eval suite. But like this is the kind of thing where we like to see because we like patch the eval. And for now most of our community has been very nice about like uh escalating like sort of these tricky issues to us. All right. Um there were some issues though like for instance the one of the earliest design decisions I made turned out to be one of the most questionable ones which is that we use GitHub actions as a scheduling mechanism. So basically uh hardware vendors will give us compute we'll hook it up as a GitHub runner and then you can like basically uh we have the artifacts stored on the GitHub action and so this is like we're basically using it as a replacement for Kubernetes or slurm you know it's okay like it works but you know it's not my proudest design choice. Um the other problem is that like most uh but okay but the reason why we need to use this and we can't purely rely on serverless stuff is the vast majority of neoclouds uh don't enable NCU access to them. So you can't like get access to like a real profiler although they you can it's like a flag that's like this flag and you can enable it. It's not by default because u allegedly Nvidia's security model doesn't make this easy. Uh, but like I said, like I think this should be a default. If any person here is that works at a Neocloud and you'd like me to tweet about you, uh, please enable NCU access and I'll give you a free advertisement. And I mean it. All right. Um, the other thing was like this this cold start time, right? I talked about how we wanted things to be interactive. And just queuing a simple job on GitHub actions takes about like a minute and a half because you need to spin up a container, install PyTorch, install PyTorch with CUDA, run the job. This takes forever. And you know our budget was closer to like 15 seconds. You know a lot of this is solved by using modal where we tend to see these like top sub top sub 10second times. Um the other thing is like for people that are using native code well turns out if you're using load inline and and pytorch load in line by default will like read about 7,000 like C++ files before like it can actually like by bind a kernel. uh this is very bad and like you know the overhead here doesn't matter how fast your your machine is like that has high overhead so we just don't do that and now the compile times go from 90 90 seconds to 5 seconds the other thing we're quite excited about has been like leveraging NVRTC a bit more so NVRTC is like a very fast slightly more restrictive uh library compiler that you can use uh but it's like makes compile times here go to like 0.1 seconds and so 01 seconds it's like absolutely fantastic and you know more people should know about and use NVRC So yeah, I mean how do we convince people to participate though? Well, you know, a big part of it is we want to partner with companies to optimize their problems. So we've had two different uh 100k kernel competitions that we've worked on with AMD. Um so this ranged from like first like doing deepseek inspired kernels to like now doing more like comps kernels. And you know we're also really thrilled like we're pro we're going to be launching another competition with Nvidia starting roughly this Friday I believe if everything goes well. uh but that's going to be like more focused on like gem a little bit kernel. So if you're interested in technologies like QDSL and Q tile and a lot of like that fun stuff you should check that out as well. All right. So the main way you we design a competition is that there's like we assume basically a PyTorch reference like as in PyTorch determines like what is correct. We pick a bunch of relevant input shapes by looking at real models. We have some sort of ranking formula like is it like the mean or the geomine and which problems that might have different scores. you know, you give some prize money and great and this is like much cheaper than hiring a grumpy CUDA hacker for a million dollars a year. So like I think more companies should do this. Um, great. So we've done like a bunch of smaller competitions like for example, Alex who isn't here like designed this like kind of nasty problem called try where you have like two mattles with like uh with with a nonlinearity and then another matal. A lot of compilers fall flat on their face here. Uh but like with David Bar who's like now unfortunately at Entropic like also figured out like a very clever solution for this by like changing like memory layouts in a trident kernel. Um so yeah I mean yeah I guess I was I was trying to vague post here but like I said we have another Nvidia competition coming soon. Uh so I hope you'll enjoy that and I think we we'll make it very interesting. So yeah I mean the outcome now is that like we have a community of people that are interested in GPU programming that want like well scope problems. It went fair fairly viral at this point we've aggregated over like 60,000 like high quality kernels from about like 600 users and by default we make it so that all the kernels that are submitted are open source so basically just by submitting it you know it's not ours it's yours you can train a model you can study the kernels you can do whatever you want like life is good right so yeah I mean if you're interested getting started is as simple as this you can install the popcorn CLI uh with this like sorry for installing a shell command from the internet but this is what we do these days. Uh and then you can like register, authenticate via either GitHub or Discord. And then when you want to make a submission, you just pick a problem. You pick a GPU and you know, you should try doing well on the the grayscale problem. That's like one of our most popular like beginner problems. So yeah, if you're interested in learning more, just check out gpum mode.com. You'll see all the instructions for it and I'd love to take questions. Thank you. [Applause] I'm surprised Georgie doesn't have a hard question for me. So, >> okay. You want to solve them? Okay. Sorry. So could you talk a bit more about the AMD like back-end platform and like how did you manage that because like uh for for like Nvidia there's model and you can probably do much faster like instantiation with that but with AMD in my experience with the competition sometime it took like 5 to 10 minutes to just get a GPU >> so how did you manage that was there a cluster that you got or something like that? Yeah, it's it's a great question. You're right. Like so modal doesn't support like AMD. So the way we made this work is this is why like GitHub actions work. We basically tell cloud vendors with bare metal access, please run the script. It'll connect the infrastructure to to us and then we can basically cue a GitHub action jobs to it and like things just work. It is a bit clunky because you have to maintain the machines, they might hang, some dependency might happen. Uh but like barring a good serverless solution, I think we've settled on a fairly good design. >> Like on demand or do you have like some clusters already like reserved? >> It's okay. >> Um yeah, so I'll repeat the question. So the question is like basically well what's our collaboration like with NeoClouds? Do we rent it? Do they give us the compute? Yeah. So, um, generally like I I would love it if we could just like host our own compute, but like it's a bit tricky to host some of this bigger compute like in something like an apartment. So, you do actually need a professional to house it. Um, and so for NeoClouds, like I think for a lot of them, they want to engage more with GPU hackers. And so, like as long as we give them like good enough publicity and as like they get like enough users to their platform, they they effectively just like compass the the compute. And so for context, like all of the hardware costs for GPU mode I've paid for out of pocket and it's cost me like around like $80 or so to manage like the entire service over the last year. Um, so I'm I'm pretty pretty proud of that. So, and a lot of the work from people here as

Original Description

Lightning Talk: KernelBot: The World's First Competitive GPU Programming Platform - Mark Saroufim, Meta KernelBot is a competitive platform that showcases the power of community-driven innovation, hosted on GPU MODE's vibrant ecosystem of 17K developers. Our platform democratizes state-of-the-art kernel authoring by bringing together GPU programmers from diverse backgrounds to collaboratively push the boundaries of performance optimization. What began as an effort to aggregate high-quality kernel tokens for LLM code generation has evolved into a thriving community movement with over 25K submissions and $100K in prizepools, where community contributors are now outperforming optimized commercial baselines. The success of this vendor-neutral platform stems from our commitment to making cutting-edge kernel development accessible to everyone. Community members leverage our fast PyTorch cold starts to iterate rapidly while we maintain minimal infrastructure barriers. By standardizing on PyTorch tensors as inputs, we've created a common foundation that allows the fastest libraries to shine regardless of their origin. Our community's achievements extend beyond individual competitions—we're open-sourcing the top-performing kernels so the entire ecosystem can learn from these innovations.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from PyTorch · PyTorch · 0 of 60

← Previous Next →
1 What is PyTorch?
What is PyTorch?
PyTorch
2 PyTorch Tutorial: A Quick Preview
PyTorch Tutorial: A Quick Preview
PyTorch
3 PyTorch Summer Hackathon 2019
PyTorch Summer Hackathon 2019
PyTorch
4 Tips and Tricks on Hacking with PyTorch: A Quick Tutorial by Brad Heintz
Tips and Tricks on Hacking with PyTorch: A Quick Tutorial by Brad Heintz
PyTorch
5 PyTorch 1.2 and PyTorch Hub: A Quick Introduction by Soumith Chintala and Ailing Zhang
PyTorch 1.2 and PyTorch Hub: A Quick Introduction by Soumith Chintala and Ailing Zhang
PyTorch
6 Torchtext 0.4 with Supervised Learning Datasets: A Quick Introduction by George Zhang
Torchtext 0.4 with Supervised Learning Datasets: A Quick Introduction by George Zhang
PyTorch
7 Torchaudio 0.3 with Kaldi Compatibility, New Transforms: A Quick Introduction by Jason Lian
Torchaudio 0.3 with Kaldi Compatibility, New Transforms: A Quick Introduction by Jason Lian
PyTorch
8 Torchvision 0.4 with Support for Video: A Quick Introduction by Francisco Massa
Torchvision 0.4 with Support for Video: A Quick Introduction by Francisco Massa
PyTorch
9 Introduction to Machine Learning for Developers at F8 2019
Introduction to Machine Learning for Developers at F8 2019
PyTorch
10 Powered by PyTorch at F8 2019
Powered by PyTorch at F8 2019
PyTorch
11 Developing and Scaling AI Experiences at Facebook with PyTorch at F8 2019
Developing and Scaling AI Experiences at Facebook with PyTorch at F8 2019
PyTorch
12 New Approaches to Image and Video Reconstruction Using Deep Learning at Facebook at F8 2019
New Approaches to Image and Video Reconstruction Using Deep Learning at Facebook at F8 2019
PyTorch
13 PyTorch Developer Conference 2018: Recap
PyTorch Developer Conference 2018: Recap
PyTorch
14 PyTorch Developer Conference 2018: Keynote & Deep Dive
PyTorch Developer Conference 2018: Keynote & Deep Dive
PyTorch
15 PyTorch Developer Conference 2018: Production & Research Sessions
PyTorch Developer Conference 2018: Production & Research Sessions
PyTorch
16 PyTorch Developer Conference 2018: Cloud & Academia Sessions
PyTorch Developer Conference 2018: Cloud & Academia Sessions
PyTorch
17 PyTorch Developer Conference 2018: Enterprise, Education, & Future of AI Panel
PyTorch Developer Conference 2018: Enterprise, Education, & Future of AI Panel
PyTorch
18 PyTorch Developer Conference 2019 | Full Livestream
PyTorch Developer Conference 2019 | Full Livestream
PyTorch
19 PyTorch Developer Conference 2019: Recap
PyTorch Developer Conference 2019: Recap
PyTorch
20 PyTorch Developer Conference Keynote - Mike Schroepfer
PyTorch Developer Conference Keynote - Mike Schroepfer
PyTorch
21 What’s new in PyTorch 1.3 - Lin Qiao
What’s new in PyTorch 1.3 - Lin Qiao
PyTorch
22 PyTorch Front-End Features: Named Tensors and Type Promotion - Gregory Chanan
PyTorch Front-End Features: Named Tensors and Type Promotion - Gregory Chanan
PyTorch
23 Research to Production: PyTorch JIT/TorchScript Updates - Michael Suo
Research to Production: PyTorch JIT/TorchScript Updates - Michael Suo
PyTorch
24 Quantization - Dmytro Dzhulgakov
Quantization - Dmytro Dzhulgakov
PyTorch
25 PyTorch ONNX Export Support - Lara Haidar, Microsoft
PyTorch ONNX Export Support - Lara Haidar, Microsoft
PyTorch
26 Apex -  Michael Carilli, NVIDIA
Apex - Michael Carilli, NVIDIA
PyTorch
27 Dataloader Design for PyTorch - Tongzhou Wang, MIT
Dataloader Design for PyTorch - Tongzhou Wang, MIT
PyTorch
28 Linear Algebra in PyTorch - Vishwak Srinivasan, CMU
Linear Algebra in PyTorch - Vishwak Srinivasan, CMU
PyTorch
29 PyTorch Mobile - David Reiss
PyTorch Mobile - David Reiss
PyTorch
30 Model Interpretability with Captum - Narine Kokhilkyan
Model Interpretability with Captum - Narine Kokhilkyan
PyTorch
31 Detectron2 - Next Gen Object Detection Library - Yuxin Wu
Detectron2 - Next Gen Object Detection Library - Yuxin Wu
PyTorch
32 Speech Extensions to Fairseq - Dmytro Okhonko
Speech Extensions to Fairseq - Dmytro Okhonko
PyTorch
33 PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook
PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook
PyTorch
34 PyTorch Summer Hackathon Winners - Joe Spisak, Sebastien Arnold, Tristan Deleu
PyTorch Summer Hackathon Winners - Joe Spisak, Sebastien Arnold, Tristan Deleu
PyTorch
35 PyTorch in Robotics - Yisong Yue, Caltech
PyTorch in Robotics - Yisong Yue, Caltech
PyTorch
36 StanfordNLP - Yuhao Zhang, Stanford
StanfordNLP - Yuhao Zhang, Stanford
PyTorch
37 Sotabench for Reproducible Research - Robert Stojnic, Papers with Code
Sotabench for Reproducible Research - Robert Stojnic, Papers with Code
PyTorch
38 Collaborative Natural Language Inference - Sasha Rush, Cornell
Collaborative Natural Language Inference - Sasha Rush, Cornell
PyTorch
39 Privacy Preserving AI - Andrew Trask, OpenMined
Privacy Preserving AI - Andrew Trask, OpenMined
PyTorch
40 CrypTen - Laurens van der Maaten
CrypTen - Laurens van der Maaten
PyTorch
41 PyTorch at Uber - Sidney Zhang, Uber
PyTorch at Uber - Sidney Zhang, Uber
PyTorch
42 PyTorch at Tesla - Andrej Karpathy, Tesla
PyTorch at Tesla - Andrej Karpathy, Tesla
PyTorch
43 PyTorch at Microsoft - Saurabh Tiwary, Microsoft
PyTorch at Microsoft - Saurabh Tiwary, Microsoft
PyTorch
44 PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs
PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs
PyTorch
45 PyTorch Developer Conference 2019 - Panel Discussion
PyTorch Developer Conference 2019 - Panel Discussion
PyTorch
46 Using deep learning and PyTorch to power next gen aircraft at Caltech
Using deep learning and PyTorch to power next gen aircraft at Caltech
PyTorch
47 Named Tensors, Model Quantization, and the Latest PyTorch Features - Part 1
Named Tensors, Model Quantization, and the Latest PyTorch Features - Part 1
PyTorch
48 TorchScript and PyTorch JIT | Deep Dive
TorchScript and PyTorch JIT | Deep Dive
PyTorch
49 Announcing the PyTorch Global Summer Hackathon 2020
Announcing the PyTorch Global Summer Hackathon 2020
PyTorch
50 Opening Up the Black Box: Model Understanding with Captum and PyTorch
Opening Up the Black Box: Model Understanding with Captum and PyTorch
PyTorch
51 PyTorch Mobile Runtime for Android
PyTorch Mobile Runtime for Android
PyTorch
52 Torchvision in 5 minutes
Torchvision in 5 minutes
PyTorch
53 3D Deep Learning with PyTorch3D
3D Deep Learning with PyTorch3D
PyTorch
54 What is Torchtext?
What is Torchtext?
PyTorch
55 TorchAudio: A Quick Intro
TorchAudio: A Quick Intro
PyTorch
56 PyTorch Mobile Runtime for iOS
PyTorch Mobile Runtime for iOS
PyTorch
57 PySlowFast: Deep learning with Video
PySlowFast: Deep learning with Video
PyTorch
58 PyTorch Pruning | How it's Made by Michela Paganini
PyTorch Pruning | How it's Made by Michela Paganini
PyTorch
59 Measuring Fairness in Machine Learning Systems
Measuring Fairness in Machine Learning Systems
PyTorch
60 PyTorch for Hackathons
PyTorch for Hackathons
PyTorch

Related Reads

Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →