Calibration-Aware Policy Optimization for Reasoning LLMs
📰 ArXiv cs.AI
arXiv:2604.12632v1 Announce Type: cross Abstract: Group Relative Policy Optimization (GRPO) enhances LLM reasoning but often induces overconfidence, where incorrect responses yield lower perplexity than correct ones, degrading relative calibration as described by the Area Under the Curve (AUC). Existing approaches either yield limited improvements in calibration or sacrifice gains in reasoning accuracy. We first prove that this degradation in GRPO-style algorithms stems from their uncertainty-ag
DeepCamp AI