Parameterized CUDA Graph Launch in PyTorch: CUDA Graphs Without the Pain - Daniel Galvez, NVIDIA
Parameterized CUDA Graph Launch in PyTorch: CUDA Graphs Without the Pain - Daniel Galvez, NVIDIA
Modern GPUs are fast enough that CPU kernel launch overhead has become a real bottleneck. CUDA Graphs can eliminate this overhead, but in practice they are hard to use and easy to get wrong.
When CUDA Graph capture fails, PyTorch users typically face two choices: fix the code that breaks capture—often with limited guidance—or capture only parts of the workload. Partial capture comes with sharp footguns, most notably large increases in device memory usage due to CUDA Graphs’ private memory pools.
This talk walks through the most common CUDA Graph capture failures seen in real PyTorch workloads and shows how to diagnose and fix them. It then presents an alternative to CUDA Graph Trees: Parameterized CUDA Graph launch, which automatically applies CUDA Graphs to only the compatible regions of a workload. All you need to do is make your workload compatible with torch.compile(). This enables CUDA Graph acceleration with minimal user effort and without increased memory usage.
Using this approach, llama3.1-70B in torchtitan runs with only a 2 GB memory increase over a non-graph baseline, compared to ~10 GB using traditional CUDA Graph techniques.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: LLM Engineering
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to AI
The Top 10 Highest-Paying Jobs Created by Artificial Intelligence in 2026 (Six-Figure Careers…
Medium · AI
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to AI
Intel is bringing a chip to every computing category at Computex. The last time it could do that, it was the company everyone was trying to catch.
The Next Web AI
🎓
Tutor Explanation
DeepCamp AI