Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

📰 ArXiv cs.AI

arXiv:2604.09613v1 Announce Type: cross Abstract: Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures -- OOM crashes, preemption storms, and request rejections. Both problems share a single root cause: configuration-traffic mismatch. We propose token-budget-aware pool routing: estimate each request's total token budget using a self-calibrating per-category

Published 14 Apr 2026
Read full paper → ← Back to Reads