SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration
📰 ArXiv cs.AI
arXiv:2604.12247v1 Announce Type: cross Abstract: Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft models but face limitations: shallow layers often produce overconfident yet incorrect token predictions, and the presence of difficult tokens in a draft sequence forces redundant computation through deeper layer
DeepCamp AI