Overtrained, Not Misaligned

📰 ArXiv cs.AI

arXiv:2605.12199v1 Announce Type: cross Abstract: Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most comprehensive EM study to date, reproducing the original GPT-4o finding and expanding to 12 open-source models across 4 families (Llama, Qwen, DeepSeek, GPT-OSS) ranging from 8B to 671B parameters, evaluating over one million model responses with

Published 13 May 2026

Read full paper → ← Back to Reads