Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models

📰 ArXiv cs.AI

arXiv:2604.24542v1 Announce Type: cross Abstract: Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer's instructions. Existing runtime defenses address these threats one at a time and often assume a clean reference model, trigger knowledge, or editable weights, assumptions that rarely hold for opaque third-

Published 28 Apr 2026
Read full paper → ← Back to Reads