ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios
📰 ArXiv cs.AI
arXiv:2604.09603v1 Announce Type: cross Abstract: Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and
DeepCamp AI