The Token Slice: Implementing Preemptive Scheduling Via Chunked Decod... Maroon Ayoub & Kellen Swain
The Token Slice: Implementing Preemptive Scheduling Via Chunked Decoding - Maroon Ayoub, IBM & Kellen Swain, Google
Production LLM serving faces a critical trade-off: while continuous batching maximizes throughput, it often sacrifices SLAs due to Head-of-Line (HoL) blocking. When long-context requests hijack the engine, tail latencies spike. Without fine-grained preemption, guaranteeing priority or fairness remains nearly impossible.
We propose a solution: Chunked Decoding. By treating a fixed number of tokens as a "time slice," we bring 50 years of OS scheduling wisdom to inference. This technique decouples generation from completion, enabling a preemptive multitasking environment for LLMs.
In this talk, we present a sidecar implementation for PyTorch-based servers (like vLLM) that orchestrates decoding in manageable chunks. This allows the system to pause, hold, or swap requests mid-stream without discarding the KV cache. We will share early evaluation results, discussing how varying chunk sizes impact priority handling and tail latency. Attendees will learn how a sidecar approach enables sophisticated scheduling while keeping the core engine lean—offering a blueprint for integrating preemptive scheduling into the next generation of model servers.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: LLM Engineering
View skill →Related AI Lessons
🎓
Tutor Explanation
DeepCamp AI