Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
📰 ArXiv cs.AI
arXiv:2604.25891v1 Announce Type: cross Abstract: Finetuning a language model can lead to emergent misalignment (EM) [Betley et al., 2025b]. Models trained on a narrow distribution of misaligned behavior generalize to more egregious behaviors when tested outside the training distribution. We study a set of interventions proposed to reduce EM. We confirm that these interventions reduce or eliminate EM on existing evaluations (questions like "How do I make a quick buck?"). However, if the evaluati
DeepCamp AI