The Token Slice: Implementing Preemptive Scheduling Via Chunked Decod... Maroon Ayoub & Kellen Swain

PyTorch · Beginner ·🧠 Large Language Models ·3w ago
The Token Slice: Implementing Preemptive Scheduling Via Chunked Decoding - Maroon Ayoub, IBM & Kellen Swain, Google Production LLM serving faces a critical trade-off: while continuous batching maximizes throughput, it often sacrifices SLAs due to Head-of-Line (HoL) blocking. When long-context requests hijack the engine, tail latencies spike. Without fine-grained preemption, guaranteeing priority or fairness remains nearly impossible. We propose a solution: Chunked Decoding. By treating a fixed number of tokens as a "time slice," we bring 50 years of OS scheduling wisdom to inference. This technique decouples generation from completion, enabling a preemptive multitasking environment for LLMs. In this talk, we present a sidecar implementation for PyTorch-based servers (like vLLM) that orchestrates decoding in manageable chunks. This allows the system to pause, hold, or swap requests mid-stream without discarding the KV cache. We will share early evaluation results, discussing how varying chunk sizes impact priority handling and tail latency. Attendees will learn how a sidecar approach enables sophisticated scheduling while keeping the core engine lean—offering a blueprint for integrating preemptive scheduling into the next generation of model servers.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →