Runtime-Certified Bounded-Error Quantized Attention

📰 ArXiv cs.AI

arXiv:2605.20868v1 Announce Type: cross Abstract: KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in syst

Published 21 May 2026

Read full paper → ← Back to Reads