Parameterized CUDA Graph Launch in PyTorch: CUDA Graphs Without the Pain - Daniel Galvez, NVIDIA

PyTorch · Beginner ·📰 AI News & Updates ·2w ago
Parameterized CUDA Graph Launch in PyTorch: CUDA Graphs Without the Pain - Daniel Galvez, NVIDIA Modern GPUs are fast enough that CPU kernel launch overhead has become a real bottleneck. CUDA Graphs can eliminate this overhead, but in practice they are hard to use and easy to get wrong. When CUDA Graph capture fails, PyTorch users typically face two choices: fix the code that breaks capture—often with limited guidance—or capture only parts of the workload. Partial capture comes with sharp footguns, most notably large increases in device memory usage due to CUDA Graphs’ private memory pools. This talk walks through the most common CUDA Graph capture failures seen in real PyTorch workloads and shows how to diagnose and fix them. It then presents an alternative to CUDA Graph Trees: Parameterized CUDA Graph launch, which automatically applies CUDA Graphs to only the compatible regions of a workload. All you need to do is make your workload compatible with torch.compile(). This enables CUDA Graph acceleration with minimal user effort and without increased memory usage. Using this approach, llama3.1-70B in torchtitan runs with only a 2 GB memory increase over a non-graph baseline, compared to ~10 GB using traditional CUDA Graph techniques.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Big Tech firms are investing heavily in AI, driving growth and transformation, while prioritizing safety and responsible adoption
Dev.to AI
The Top 10 Highest-Paying Jobs Created by Artificial Intelligence in 2026 (Six-Figure Careers…
Discover the top 10 highest-paying jobs created by AI in 2026, with six-figure salaries, and learn how to pursue a lucrative career in this field
Medium · AI
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Big Tech firms are investing heavily in AI, driving growth and transformation, while prioritizing safety and responsible adoption
Dev.to AI
Intel is bringing a chip to every computing category at Computex. The last time it could do that, it was the company everyone was trying to catch.
Intel is launching a chip for every computing category at Computex 2026, marking a significant milestone for the company
The Next Web AI
Up next
AI Ads Era Starts l Google’s Game-Changing Update Explained l Digital Dady
Digital Dady
Watch →