Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
📰 ArXiv cs.AI
arXiv:2604.09613v1 Announce Type: cross Abstract: Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures -- OOM crashes, preemption storms, and request rejections. Both problems share a single root cause: configuration-traffic mismatch. We propose token-budget-aware pool routing: estimate each request's total token budget using a self-calibrating per-category
DeepCamp AI