JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
📰 ArXiv cs.AI
arXiv:2604.25419v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that deco
DeepCamp AI