Evaluating whether AI models would sabotage AI safety research

📰 ArXiv cs.AI

arXiv:2604.24618v1 Announce Type: new Abstract: We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluations to four Claude models (Mythos Preview, Opus 4.7 Preview, Opus 4.6, and Sonnet 4.6): an unprompted sabotage evaluation testing model behaviour with opportunities to sabotage safety research, and a sabotage continuation evaluation testing whether mod

Published 28 Apr 2026

Read full paper → ← Back to Reads