How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models

📰 ArXiv cs.AI

arXiv:2606.03002v1 Announce Type: cross Abstract: Quantization is a standard path to deploying large language models, and a quantized model is typically judged acceptable when its perplexity or downstream accuracy stays close to the full-precision original. Whether the model still computes in the same way, or whether the interpretable features identified in the full-precision model survive weight rounding, is rarely tested, even as safety audits and steering interventions increasingly rely on th

Published 3 Jun 2026
Read full paper → ← Back to Reads