IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

📰 ArXiv cs.AI

arXiv:2604.10539v1 Announce Type: cross Abstract: Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU

Published 14 Apr 2026

Read full paper → ← Back to Reads