Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective
📰 ArXiv cs.AI
arXiv:2604.25077v1 Announce Type: new Abstract: Weak-to-strong alignment offers a promising route to scalable supervision, but it can fail when a strong model becomes confidently wrong on examples that lie in the weak teacher's blind spots. Understanding such failures requires going beyond aggregate accuracy, since weak-to-strong errors depend not only on whether the strong model disagrees with its teacher, but also on how confidence and uncertainty are distributed across examples. In this work,
DeepCamp AI