ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

📰 ArXiv cs.AI

arXiv:2604.09603v1 Announce Type: cross Abstract: Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and

Published 14 Apr 2026

Read full paper → ← Back to Reads