Profiling PyTorch training without accidentally stalling the GPU [D]

📰 Reddit r/MachineLearning

Profiling PyTorch training has an interesting measurement problem: the more you measure, the more you can change the behavior of the run itself. A simple example is torch.cuda.synchronize() . It gives cleaner timing boundaries, but it also inserts synchronization points into an otherwise asynchronous CUDA workload. An alternative is to use CUDA events around selected boundaries and read them later, so timing can be captured without forc

Published 27 May 2026

Read full article → ← Back to Reads