ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

📰 ArXiv cs.AI

arXiv:2604.10898v1 Announce Type: new Abstract: Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache f

Published 14 Apr 2026

Read full paper → ← Back to Reads