Quantization Dominates Rank Reduction for KV-Cache Compression

📰 ArXiv cs.AI

arXiv:2604.11501v1 Announce Type: cross Abstract: We compare two strategies for compressing the KV cache in transformer inference: rank reduction (discard dimensions) and quantization (keep all dimensions, reduce precision). At matched storage budgets across five models (124M-14B, MHA and GQA), we find that quantization consistently outperforms rank reduction by 4-364 PPL depending on model and compression level. The gap persists even when rank reduction is combined with quantization in hybrid b

Published 14 Apr 2026
Read full paper → ← Back to Reads