CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
📰 ArXiv cs.AI
arXiv:2604.08584v1 Announce Type: cross Abstract: Long-context LLMs increasingly rely on extended, reusable prefill prompts for agents and domain Q&A, pushing attention and KV-cache to become the dominant decode-time bottlenecks. While sparse attention reduces computation and transfer costs, it often struggles to maintain accuracy at high sparsity levels due to the inherent distribution shift between Queries and Keys. We propose Centroid-Scoring Attention (CSAttention), a training-free sparse at
DeepCamp AI