JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

📰 ArXiv cs.AI

arXiv:2604.25419v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that deco

Published 29 Apr 2026

Read full paper → ← Back to Reads