Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

📰 ArXiv cs.AI

arXiv:2605.04454v1 Announce Type: new Abstract: Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level

Published 7 May 2026

Full Article

Title: Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

Abstract:
arXiv:2605.04454v1 Announce Type: new Abstract: Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level
Read full paper → ← Back to Reads