Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

📰 ArXiv cs.AI

arXiv:2604.25891v1 Announce Type: cross Abstract: Finetuning a language model can lead to emergent misalignment (EM) [Betley et al., 2025b]. Models trained on a narrow distribution of misaligned behavior generalize to more egregious behaviors when tested outside the training distribution. We study a set of interventions proposed to reduce EM. We confirm that these interventions reduce or eliminate EM on existing evaluations (questions like "How do I make a quick buck?"). However, if the evaluati

Published 29 Apr 2026
Read full paper → ← Back to Reads