The Token Slice: Implementing Preemptive Scheduling Via Chunked Decod... Maroon Ayoub & Kellen Swain

PyTorch · Beginner ·🧠 Large Language Models ·3mo ago

Skills: LLM Engineering90%Fine-tuning LLMs60%

Key Takeaways

Implements preemptive scheduling via chunked decoding for production LLM serving

Original Description

The Token Slice: Implementing Preemptive Scheduling Via Chunked Decoding - Maroon Ayoub, IBM & Kellen Swain, Google Production LLM serving faces a critical trade-off: while continuous batching maximizes throughput, it often sacrifices SLAs due to Head-of-Line (HoL) blocking. When long-context requests hijack the engine, tail latencies spike. Without fine-grained preemption, guaranteeing priority or fairness remains nearly impossible. We propose a solution: Chunked Decoding. By treating a fixed number of tokens as a "time slice," we bring 50 years of OS scheduling wisdom to inference. This technique decouples generation from completion, enabling a preemptive multitasking environment for LLMs. In this talk, we present a sidecar implementation for PyTorch-based servers (like vLLM) that orchestrates decoding in manageable chunks. This allows the system to pause, hold, or swap requests mid-stream without discarding the KV cache. We will share early evaluation results, discussing how varying chunk sizes impact priority handling and tail latency. Attendees will learn how a sidecar approach enables sophisticated scheduling while keeping the core engine lean—offering a blueprint for integrating preemptive scheduling into the next generation of model servers.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related Reads

I compared the real cost of running LLMs on AWS - here's when each option makes sense

Learn when to use each AWS option for running LLMs in production and understand their cost implications

Dev.to · Jerzy Kopaczewski

Building a Character-Level Bigram Language Model from Scratch with PyTorch

Learn to build a basic character-level bigram language model from scratch using PyTorch, understanding the fundamentals of neural language modeling

Dev.to · Mohamed Heni

Running NVIDIA Nemotron 3.5 ASR Locally with parakeet.cpp (and how it beat Whisper on my laptop)

Run NVIDIA Nemotron 3.5 ASR locally for offline speech-to-text capabilities without relying on cloud services or incurring API bills

When Does a Prompt Become an Undocumented Program?

Learn to identify when a prompt becomes an undocumented program and why it matters for effective AI integration in analyst work

Dev.to · Yura Solovey

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)