Architecture Determines Observability in Transformers

📰 ArXiv cs.AI

arXiv:2604.24801v1 Announce Type: cross Abstract: Autoregressive transformers make confident errors, but activation monitoring can catch them only if the model preserves an internal signal that output confidence does not expose. This preservation is determined by architecture and training recipe. We define observability as the linear readability of per-token decision quality from frozen mid-layer activations after controlling for max-softmax confidence and activation norm. The correction is esse

Published 29 Apr 2026

Read full paper → ← Back to Reads