Model Spec Midtraining: Improving How Alignment Training Generalizes

📰 ArXiv cs.AI

arXiv:2605.02087v1 Announce Type: new Abstract: Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning -- training on demonstrations of spec-aligned behavior -- can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fin

Published 5 May 2026

Read full paper → ← Back to Reads