Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

📰 ArXiv cs.AI

arXiv:2602.08585v2 Announce Type: replace-cross Abstract: Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of to

Published 2 Jun 2026
Read full paper → ← Back to Reads