Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
📰 ArXiv cs.AI
arXiv:2604.24820v1 Announce Type: cross Abstract: Long contexts improve capabilities of large language models but pose serious hardware challenges: compute and memory footprints grow linearly with sequence length. Particularly, the decoding phase continuously accesses massive KV cache, dramatically increasing bandwidth and computing pressure. Existing accelerators are primarily designed and evaluated for short contexts. They suffer from significant performance degradation when processing long co
DeepCamp AI