Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
📰 ArXiv cs.AI
arXiv:2605.04454v1 Announce Type: new Abstract: Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level
Full Article
Title: Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
Abstract:
arXiv:2605.04454v1 Announce Type: new Abstract: Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level
Abstract:
arXiv:2605.04454v1 Announce Type: new Abstract: Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level
DeepCamp AI