UCPO: Uncertainty-Aware Policy Optimization

📰 ArXiv cs.AI

arXiv:2601.22648v2 Announce Type: replace Abstract: The key to building trustworthy large language models (LLMs) lies in endowing them with inherent uncertainty expression capabilities, thereby mitigating overconfident errors in high-stakes applications. However, existing RL paradigms such as GRPO often suffer from Advantage Bias due to binary decision spaces and static uncertainty rewards, inducing either excessive conservatism or overconfidence. To tackle this challenge, this paper unveils the

Published 27 May 2026
Read full paper → ← Back to Reads