Less Is More: Fast and Accurate Reasoning with Cross-Head Unified Sparse Attention

📰 ArXiv cs.AI

arXiv:2508.07101v2 Announce Type: replace-cross Abstract: Large reasoning models achieve strong performance through test-time scaling, but this incurs substantial computational overhead due to long decoding from short prompts. While sparse attention can reduce latency and memory usage, existing methods often degrade reasoning accuracy because selection errors accumulate over long generation horizons, or require costly retraining. We introduce LessIsMore, a training-free sparse attention mechanis

Published 29 Apr 2026

Read full paper → ← Back to Reads