Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

📰 ArXiv cs.AI

arXiv:2603.18280v2 Announce Type: replace-cross Abstract: Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings fol

Published 14 Apr 2026

Read full paper → ← Back to Reads