StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving

📰 ArXiv cs.AI

arXiv:2604.09562v1 Announce Type: cross Abstract: Efficient LLM serving must balance throughput and latency across diverse, bursty workloads. We introduce StreamServe, a disaggregated prefill decode serving architecture that combines metric aware routing across compute lanes with adaptive speculative decoding that tunes speculation depth online from runtime signals. StreamServe comprises four components: StreamScheduler for request orchestration, FlowGuard for multi signal routing, PipeServe Eng

Published 14 Apr 2026
Read full paper → ← Back to Reads