Lightning Talk: Why Logging Isn’t Enough: Making PyTorch Training Regressions Vi... Sahana Venkatesh

Name: Lightning Talk: Why Logging Isn’t Enough: Making PyTorch Training Regressions Vi... Sahana Venkatesh
Uploaded: 2026-04-20T20:21:45Z
Channel: PyTorch
Description: Lightning Talk: Why Logging Isn’t Enough: Making PyTorch Training Regressions Visible in Practice - Sahana Venkatesh, Wayve PyTorch teams often log rich...

PyTorch · Intermediate ·📊 Data Analytics & Business Intelligence ·2w ago

Lightning Talk: Why Logging Isn’t Enough: Making PyTorch Training Regressions Visible in Practice - Sahana Venkatesh, Wayve PyTorch teams often log rich training metrics, yet still discover training regressions late after significant developer time and GPU budget have already been spent. In this talk, I’ll share a practical pattern we used to turn PyTorch training metrics into an operational guardrail for large-model training. The approach combines scheduled short and long training runs, standardized performance and stability metrics (throughput, memory, loss, divergence), and simple statistical baselines to automatically surface regressions via alerts without hard gates or complex infrastructure. I’ll focus on why logging alone is insufficient, how we chose what to monitor, and what tradeoffs we encountered (false positives, alert fatigue, baseline drift). The goal is not a tool demo, but a reusable pattern other PyTorch teams can adapt to catch training regressions earlier and make retraining more predictable.

Watch on YouTube ↗ (saves to browser)