Parameterized CUDA Graph Launch in PyTorch: CUDA Graphs Without the Pain - Daniel Galvez, NVIDIA

Name: Parameterized CUDA Graph Launch in PyTorch: CUDA Graphs Without the Pain - Daniel Galvez, NVIDIA
Uploaded: 2026-04-20T20:22:20Z
Channel: PyTorch
Description: Parameterized CUDA Graph Launch in PyTorch: CUDA Graphs Without the Pain - Daniel Galvez, NVIDIA Modern GPUs are fast enough that CPU kernel launch over...

PyTorch · Beginner ·📰 AI News & Updates ·2w ago

Skills: LLM Engineering80%Training at Scale60%

Parameterized CUDA Graph Launch in PyTorch: CUDA Graphs Without the Pain - Daniel Galvez, NVIDIA Modern GPUs are fast enough that CPU kernel launch overhead has become a real bottleneck. CUDA Graphs can eliminate this overhead, but in practice they are hard to use and easy to get wrong. When CUDA Graph capture fails, PyTorch users typically face two choices: fix the code that breaks capture—often with limited guidance—or capture only parts of the workload. Partial capture comes with sharp footguns, most notably large increases in device memory usage due to CUDA Graphs’ private memory pools. This talk walks through the most common CUDA Graph capture failures seen in real PyTorch workloads and shows how to diagnose and fix them. It then presents an alternative to CUDA Graph Trees: Parameterized CUDA Graph launch, which automatically applies CUDA Graphs to only the compatible regions of a workload. All you need to do is make your workload compatible with torch.compile(). This enables CUDA Graph acceleration with minimal user effort and without increased memory usage. Using this approach, llama3.1-70B in torchtitan runs with only a 2 GB memory increase over a non-graph baseline, compared to ~10 GB using traditional CUDA Graph techniques.

Watch on YouTube ↗ (saves to browser)