Lightning Talk: KernelBot: The World's First Competitive GPU Programming Platform - Mark Saroufim

PyTorch · Beginner ·🧠 Large Language Models ·8mo ago

Skills: ML Pipelines50%AI Systems Design50%

Key Takeaways

Introduces KernelBot, a competitive GPU programming platform

Full Transcript

Hey folks, this is like if you're around like for the earlier talk, this is like very much in the same theme. Uh I want to talk today about like kernelbot which is a competitive platform for writing like fast GPU code. And this was like a sort of very much like an open source collaboration that I worked on with like a whole bunch of people like like Alex was in the room like mate who's here. I won't make you wave your hand again. Uh Eric, like Elaine and myself. So uh yes. Okay. See they all waved. Mate, it's not a big deal. You can wave. All right. So the the the sort of like core motivation behind building a competitive GPU programming platform was that uh we noticed like there was hardly any interesting kernel data on GitHub. The data that's interesting is either uh you know under a license where you can't train a model with it or if it does have a it does have a permissive license then it's like slop. And so we we basically uh had been interested in like taking we had some early work for example with this project called kernel LM where what we did was we used like compiler generated data to train an LLM. This works really well but then the data looks like it was compiler generated. It doesn't look like a human actually wrote it. So again how can we like flood the internet with like more highquality GPU kernels and get like LMS to be like a bit better at solving this problem. All right. So um basically here like the the sort of benefit we had with GPU mode is like GPU mode is a fairly active uh discord community with like over 20,000 people at this point. People nerd out on all sorts of topics related to GPU programming from like weekly lectures like working groups. Um and so what we wanted to do was like we wanted to convince like more of these smart people on the server to donate to us like data for free. And the way like the right way to do this is like we we give them like money with price pools and they give us their expertise back. So, it's like a nice like it's a nice and even exchange. All right. So, the problem now is like, okay, well, you know, we have 20,000 people. How do we give them all a GPU? That's very expensive. Even H100s are cheap. They're $1.90 an hour, but we can't give 20,000 GPUs to people. Um, so we wanted something that's like a time shared like basically we uh we have like one pool of GPUs and then whenever you submit a job, you get access to a GPU, you run your kernel, you get results back. The problem with most traditional setups for this is that like it takes a long time to queue a job like on the order of 12 two minutes to three minutes for like most serverless startups. Uh and so our solution was like to have a workflow that takes closer to like 15 seconds or so where you don't even have access to the hardware but you submit like your code like in this case you would attach a file within Discord directly and you get like a result back like within 15 seconds. So we wanted to be like fast and interactive in the same way that like good video games are developed, right? So like I said, we we had this like existing community that we could use. It was like fun. Like for instance, uh there was like an earlier question from Georgie on thermal throttling. Well, you know what one person figured out was uh they could like basically submit a kernel at the right time of day or depending on load and use that to get a slightly different kernel. So this is great. Like we we we love the creativity. Um and yeah and although like Discord was quite nice in the beginning because we had this large community like it is a bit clunky because every time you want to submit a new kernel you have to attach another file and refill a bunch of like arguments like which leaderboard was this which GPU was this and so this was like all quite uh clunky. So what mate built is like something really neat. It's basically a CLI where you can just say like popcorn CLI submit with your file and then you can just submit a kernel. Uh so like at the top half of our your monitor you'll have VS code and the bottom half you'll have your terminal. You know we use Rust. It doesn't matter because I don't think Rust is the bottleneck here but like Rust is cool. Uh the main problem is like the there's like less community interaction here but it is eventually what like most of our advanced users ended up picking up. Um we also have like a very nice website that like Elaine here in the audience like built for us. So it's like very similar. You can come you can see the leaderboard. You can see who's like like a very top entry. And similarly, you can click on a specific problem uh and then like submit stuff like directly like within the website. So this is like over the long term what we expect to be the primary experience people to have on the server. Um so all of these make it nice like people can just use whatever they want and we're quite fortunate that a lot of our smartest hackers would make writeups describing how how good their solutions are. So we have an architecture diagram but like you know it doesn't it's a bit boring. Let's skip it. All right. So um the sort of quick things we learned here is that like uh designing hard interesting problems is tricky like you want a problem that's interesting. Uh you have to pick like meaningful problems and shapes as well. You can't like pick very small ones because then you just like measure overhead. We also noticed that like sometimes the data distribution could be gamed. For example, if you wanted to produce the world's fest vector mean kernel and you're doing like random inputs with mean zero. Uh well the result is almost always zero. And so that's the world's fastest vector mean kernel. that's bad. So, we want like basically our problem distributions to avoid these kinds of problems. So, we have like a whole bunch of tests. You know, of course, the trade-off here is that this increases the the time and it it wrecks the interactability a bit. Um the other thing was like security like we're basically distributing like very valuable assets to the rest of the world. Uh you know, and people can submit Python files and like Python is an interesting language like for instance in Python you can within a Python file pip install stuff and it like works. you can like start an SSH server like it's you can start a server that also works. So we we had to do a lot of work to avoid hacky submissions. Like one source of hacky submissions is also people figuring out how to game our eval suite. But like this is the kind of thing where we like to see because we like patch the eval. And for now most of our community has been very nice about like uh escalating like sort of these tricky issues to us. All right. Um there were some issues though like for instance the one of the earliest design decisions I made turned out to be one of the most questionable ones which is that we use GitHub actions as a scheduling mechanism. So basically uh hardware vendors will give us compute we'll hook it up as a GitHub runner and then you can like basically uh we have the artifacts stored on the GitHub action and so this is like we're basically using it as a replacement for Kubernetes or slurm you know it's okay like it works but you know it's not my proudest design choice. Um the other problem is that like most uh but okay but the reason why we need to use this and we can't purely rely on serverless stuff is the vast majority of neoclouds uh don't enable NCU access to them. So you can't like get access to like a real profiler although they you can it's like a flag that's like this flag and you can enable it. It's not by default because u allegedly Nvidia's security model doesn't make this easy. Uh, but like I said, like I think this should be a default. If any person here is that works at a Neocloud and you'd like me to tweet about you, uh, please enable NCU access and I'll give you a free advertisement. And I mean it. All right. Um, the other thing was like this this cold start time, right? I talked about how we wanted things to be interactive. And just queuing a simple job on GitHub actions takes about like a minute and a half because you need to spin up a container, install PyTorch, install PyTorch with CUDA, run the job. This takes forever. And you know our budget was closer to like 15 seconds. You know a lot of this is solved by using modal where we tend to see these like top sub top sub 10second times. Um the other thing is like for people that are using native code well turns out if you're using load inline and and pytorch load in line by default will like read about 7,000 like C++ files before like it can actually like by bind a kernel. uh this is very bad and like you know the overhead here doesn't matter how fast your your machine is like that has high overhead so we just don't do that and now the compile times go from 90 90 seconds to 5 seconds the other thing we're quite excited about has been like leveraging NVRTC a bit more so NVRTC is like a very fast slightly more restrictive uh library compiler that you can use uh but it's like makes compile times here go to like 0.1 seconds and so 01 seconds it's like absolutely fantastic and you know more people should know about and use NVRC So yeah, I mean how do we convince people to participate though? Well, you know, a big part of it is we want to partner with companies to optimize their problems. So we've had two different uh 100k kernel competitions that we've worked on with AMD. Um so this ranged from like first like doing deepseek inspired kernels to like now doing more like comps kernels. And you know we're also really thrilled like we're pro we're going to be launching another competition with Nvidia starting roughly this Friday I believe if everything goes well. uh but that's going to be like more focused on like gem a little bit kernel. So if you're interested in technologies like QDSL and Q tile and a lot of like that fun stuff you should check that out as well. All right. So the main way you we design a competition is that there's like we assume basically a PyTorch reference like as in PyTorch determines like what is correct. We pick a bunch of relevant input shapes by looking at real models. We have some sort of ranking formula like is it like the mean or the geomine and which problems that might have different scores. you know, you give some prize money and great and this is like much cheaper than hiring a grumpy CUDA hacker for a million dollars a year. So like I think more companies should do this. Um, great. So we've done like a bunch of smaller competitions like for example, Alex who isn't here like designed this like kind of nasty problem called try where you have like two mattles with like uh with with a nonlinearity and then another matal. A lot of compilers fall flat on their face here. Uh but like with David Bar who's like now unfortunately at Entropic like also figured out like a very clever solution for this by like changing like memory layouts in a trident kernel. Um so yeah I mean yeah I guess I was I was trying to vague post here but like I said we have another Nvidia competition coming soon. Uh so I hope you'll enjoy that and I think we we'll make it very interesting. So yeah I mean the outcome now is that like we have a community of people that are interested in GPU programming that want like well scope problems. It went fair fairly viral at this point we've aggregated over like 60,000 like high quality kernels from about like 600 users and by default we make it so that all the kernels that are submitted are open source so basically just by submitting it you know it's not ours it's yours you can train a model you can study the kernels you can do whatever you want like life is good right so yeah I mean if you're interested getting started is as simple as this you can install the popcorn CLI uh with this like sorry for installing a shell command from the internet but this is what we do these days. Uh and then you can like register, authenticate via either GitHub or Discord. And then when you want to make a submission, you just pick a problem. You pick a GPU and you know, you should try doing well on the the grayscale problem. That's like one of our most popular like beginner problems. So yeah, if you're interested in learning more, just check out gpum mode.com. You'll see all the instructions for it and I'd love to take questions. Thank you. [Applause] I'm surprised Georgie doesn't have a hard question for me. So, >> okay. You want to solve them? Okay. Sorry. So could you talk a bit more about the AMD like back-end platform and like how did you manage that because like uh for for like Nvidia there's model and you can probably do much faster like instantiation with that but with AMD in my experience with the competition sometime it took like 5 to 10 minutes to just get a GPU >> so how did you manage that was there a cluster that you got or something like that? Yeah, it's it's a great question. You're right. Like so modal doesn't support like AMD. So the way we made this work is this is why like GitHub actions work. We basically tell cloud vendors with bare metal access, please run the script. It'll connect the infrastructure to to us and then we can basically cue a GitHub action jobs to it and like things just work. It is a bit clunky because you have to maintain the machines, they might hang, some dependency might happen. Uh but like barring a good serverless solution, I think we've settled on a fairly good design. >> Like on demand or do you have like some clusters already like reserved? >> It's okay. >> Um yeah, so I'll repeat the question. So the question is like basically well what's our collaboration like with NeoClouds? Do we rent it? Do they give us the compute? Yeah. So, um, generally like I I would love it if we could just like host our own compute, but like it's a bit tricky to host some of this bigger compute like in something like an apartment. So, you do actually need a professional to house it. Um, and so for NeoClouds, like I think for a lot of them, they want to engage more with GPU hackers. And so, like as long as we give them like good enough publicity and as like they get like enough users to their platform, they they effectively just like compass the the compute. And so for context, like all of the hardware costs for GPU mode I've paid for out of pocket and it's cost me like around like $80 or so to manage like the entire service over the last year. Um, so I'm I'm pretty pretty proud of that. So, and a lot of the work from people here as

Original Description

Lightning Talk: KernelBot: The World's First Competitive GPU Programming Platform - Mark Saroufim, Meta KernelBot is a competitive platform that showcases the power of community-driven innovation, hosted on GPU MODE's vibrant ecosystem of 17K developers. Our platform democratizes state-of-the-art kernel authoring by bringing together GPU programmers from diverse backgrounds to collaboratively push the boundaries of performance optimization. What began as an effort to aggregate high-quality kernel tokens for LLM code generation has evolved into a thriving community movement with over 25K submissions and $100K in prizepools, where community contributors are now outperforming optimized commercial baselines. The success of this vendor-neutral platform stems from our commitment to making cutting-edge kernel development accessible to everyone. Community members leverage our fast PyTorch cold starts to iterate rapidly while we maintain minimal infrastructure barriers. By standardizing on PyTorch tensors as inputs, we've created a common foundation that allows the fastest libraries to shine regardless of their origin. Our community's achievements extend beyond individual competitions—we're open-sourcing the top-performing kernels so the entire ecosystem can learn from these innovations.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from PyTorch · PyTorch · 0 of 60

← Previous Next →

What is PyTorch?

What is PyTorch?

PyTorch Tutorial: A Quick Preview

PyTorch Tutorial: A Quick Preview

PyTorch Summer Hackathon 2019

PyTorch Summer Hackathon 2019

Tips and Tricks on Hacking with PyTorch: A Quick Tutorial by Brad Heintz

Tips and Tricks on Hacking with PyTorch: A Quick Tutorial by Brad Heintz

PyTorch 1.2 and PyTorch Hub: A Quick Introduction by Soumith Chintala and Ailing Zhang

PyTorch 1.2 and PyTorch Hub: A Quick Introduction by Soumith Chintala and Ailing Zhang

Torchtext 0.4 with Supervised Learning Datasets: A Quick Introduction by George Zhang

Torchtext 0.4 with Supervised Learning Datasets: A Quick Introduction by George Zhang

Torchaudio 0.3 with Kaldi Compatibility, New Transforms: A Quick Introduction by Jason Lian

Torchaudio 0.3 with Kaldi Compatibility, New Transforms: A Quick Introduction by Jason Lian

Torchvision 0.4 with Support for Video: A Quick Introduction by Francisco Massa

Torchvision 0.4 with Support for Video: A Quick Introduction by Francisco Massa

Introduction to Machine Learning for Developers at F8 2019

Introduction to Machine Learning for Developers at F8 2019

Powered by PyTorch at F8 2019

Powered by PyTorch at F8 2019

Developing and Scaling AI Experiences at Facebook with PyTorch at F8 2019

Developing and Scaling AI Experiences at Facebook with PyTorch at F8 2019

New Approaches to Image and Video Reconstruction Using Deep Learning at Facebook at F8 2019

New Approaches to Image and Video Reconstruction Using Deep Learning at Facebook at F8 2019

PyTorch Developer Conference 2018: Recap

PyTorch Developer Conference 2018: Recap

PyTorch Developer Conference 2018: Keynote & Deep Dive

PyTorch Developer Conference 2018: Keynote & Deep Dive

PyTorch Developer Conference 2018: Production & Research Sessions

PyTorch Developer Conference 2018: Production & Research Sessions

PyTorch Developer Conference 2018: Cloud & Academia Sessions

PyTorch Developer Conference 2018: Cloud & Academia Sessions

PyTorch Developer Conference 2018: Enterprise, Education, & Future of AI Panel

PyTorch Developer Conference 2018: Enterprise, Education, & Future of AI Panel

PyTorch Developer Conference 2019 | Full Livestream

PyTorch Developer Conference 2019 | Full Livestream

PyTorch Developer Conference 2019: Recap

PyTorch Developer Conference 2019: Recap

PyTorch Developer Conference Keynote - Mike Schroepfer

PyTorch Developer Conference Keynote - Mike Schroepfer

What’s new in PyTorch 1.3 - Lin Qiao

What’s new in PyTorch 1.3 - Lin Qiao

PyTorch Front-End Features: Named Tensors and Type Promotion - Gregory Chanan

PyTorch Front-End Features: Named Tensors and Type Promotion - Gregory Chanan

Research to Production: PyTorch JIT/TorchScript Updates - Michael Suo

Research to Production: PyTorch JIT/TorchScript Updates - Michael Suo

Quantization - Dmytro Dzhulgakov

Quantization - Dmytro Dzhulgakov

PyTorch ONNX Export Support - Lara Haidar, Microsoft

PyTorch ONNX Export Support - Lara Haidar, Microsoft

Apex - Michael Carilli, NVIDIA

Apex - Michael Carilli, NVIDIA

Dataloader Design for PyTorch - Tongzhou Wang, MIT

Dataloader Design for PyTorch - Tongzhou Wang, MIT

Linear Algebra in PyTorch - Vishwak Srinivasan, CMU

Linear Algebra in PyTorch - Vishwak Srinivasan, CMU

PyTorch Mobile - David Reiss

PyTorch Mobile - David Reiss

Model Interpretability with Captum - Narine Kokhilkyan

Model Interpretability with Captum - Narine Kokhilkyan

Detectron2 - Next Gen Object Detection Library - Yuxin Wu

Detectron2 - Next Gen Object Detection Library - Yuxin Wu

Speech Extensions to Fairseq - Dmytro Okhonko

Speech Extensions to Fairseq - Dmytro Okhonko

PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook

PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook

PyTorch Summer Hackathon Winners - Joe Spisak, Sebastien Arnold, Tristan Deleu

PyTorch Summer Hackathon Winners - Joe Spisak, Sebastien Arnold, Tristan Deleu

PyTorch in Robotics - Yisong Yue, Caltech

PyTorch in Robotics - Yisong Yue, Caltech

StanfordNLP - Yuhao Zhang, Stanford

StanfordNLP - Yuhao Zhang, Stanford

Sotabench for Reproducible Research - Robert Stojnic, Papers with Code

Sotabench for Reproducible Research - Robert Stojnic, Papers with Code

Collaborative Natural Language Inference - Sasha Rush, Cornell

Collaborative Natural Language Inference - Sasha Rush, Cornell

Privacy Preserving AI - Andrew Trask, OpenMined

Privacy Preserving AI - Andrew Trask, OpenMined

CrypTen - Laurens van der Maaten

CrypTen - Laurens van der Maaten

PyTorch at Uber - Sidney Zhang, Uber

PyTorch at Uber - Sidney Zhang, Uber

PyTorch at Tesla - Andrej Karpathy, Tesla

PyTorch at Tesla - Andrej Karpathy, Tesla

PyTorch at Microsoft - Saurabh Tiwary, Microsoft

PyTorch at Microsoft - Saurabh Tiwary, Microsoft

PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs

PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs

PyTorch Developer Conference 2019 - Panel Discussion

PyTorch Developer Conference 2019 - Panel Discussion

Using deep learning and PyTorch to power next gen aircraft at Caltech

Using deep learning and PyTorch to power next gen aircraft at Caltech

Named Tensors, Model Quantization, and the Latest PyTorch Features - Part 1

Named Tensors, Model Quantization, and the Latest PyTorch Features - Part 1

TorchScript and PyTorch JIT | Deep Dive

TorchScript and PyTorch JIT | Deep Dive

Announcing the PyTorch Global Summer Hackathon 2020

Announcing the PyTorch Global Summer Hackathon 2020

Opening Up the Black Box: Model Understanding with Captum and PyTorch

Opening Up the Black Box: Model Understanding with Captum and PyTorch

PyTorch Mobile Runtime for Android

PyTorch Mobile Runtime for Android

Torchvision in 5 minutes

Torchvision in 5 minutes

3D Deep Learning with PyTorch3D

3D Deep Learning with PyTorch3D

What is Torchtext?

What is Torchtext?

TorchAudio: A Quick Intro

TorchAudio: A Quick Intro

PyTorch Mobile Runtime for iOS

PyTorch Mobile Runtime for iOS

PySlowFast: Deep learning with Video

PySlowFast: Deep learning with Video

PyTorch Pruning | How it's Made by Michela Paganini

PyTorch Pruning | How it's Made by Michela Paganini

Measuring Fairness in Machine Learning Systems

Measuring Fairness in Machine Learning Systems

PyTorch for Hackathons

PyTorch for Hackathons

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related Reads

I Ran 10 Coding AIs Through Real Client Work — Here's the Bill

Learn how to optimize coding AI costs for freelance development work and discover the most cost-effective models for client projects

LLMs replace humans in 8-track tape optimization

LLMs can optimize 8-track tape partitioning, replacing human engineers and improving the listening experience

Are the Free OpenRouter Models any Good for Coding?

Explore free OpenRouter models for coding and learn how to use them effectively

Why ChatGPT Cannot Wash Dishes

Discover the limitations of ChatGPT and why it can't perform physical tasks like washing dishes

Dev.to · Developer-friendly

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)