Beyond the Theory: What Actually Breaks When You Scale Your Disaggregat... Ekin Karabulut & Ron Kahn

Name: Beyond the Theory: What Actually Breaks When You Scale Your Disaggregat... Ekin Karabulut & Ron Kahn
Uploaded: 2026-04-20T20:21:45Z
Channel: PyTorch
Description: Beyond the Theory: What Actually Breaks When You Scale Your Disaggregated Pytorch Models - Ekin Karabulut & Ron Kahn, NVIDIA As inference demand explode...

PyTorch · Advanced ·🧠 Large Language Models ·3w ago

Skills: LLM Engineering90%ML Pipelines60%

Beyond the Theory: What Actually Breaks When You Scale Your Disaggregated Pytorch Models - Ekin Karabulut & Ron Kahn, NVIDIA As inference demand explodes, new techniques to optimize these deployments have emerged. One such technique is disaggregated inference, which splits inference into differently optimized workloads (e.g. prefill and decode) on separate workers. The theory is straightforward–better GPU utilization, inference performance, and tighter control over SLAs.The deployment in production is not. Scaling happens at multiple connected levels. Adding prefill workers for a traffic spike? Those workers belong to a prefill leader and must scale as a unit. But your prefill-to-decode ratio matters too, scale prefill without matching decode capacity and you've moved the bottleneck.Placement also plays a role: place prefill and decode far apart in your network topology and KV-cache transfers will kill your latency.Standard autoscaling treats these as independent components.They're not. In this talk, we'll share what we've learned running disaggregated vLLM and SGLang deployments on K8s: what broke,what worked, and how we're improving performance. We'll evaluate approaches from standard deployments to specialized APIs like LWS and Grove, discuss how these integrate with frameworks like llm-d and Dynamo.

Watch on YouTube ↗ (saves to browser)