Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks
📰 ArXiv cs.AI
Optimize token-level reasoning in LLMs using Direct Reasoning Optimization, which combines Reasoning Reflection Reward with rubric-gating for unverifiable tasks
Action Steps
- Define a token-level dense Reasoning Reflection Reward (R3) to measure reasoning quality
- Implement rubric-gating as feasibility constraints at the rollout group level
- Optimize R3 using reinforcement learning (RL) training
- Enforce rubric-gating constraints during RL training
- Evaluate the performance of the optimized model on unverifiable tasks
Who Needs to Know This
NLP engineers and researchers can benefit from this technique to improve the reasoning quality of their LLMs, especially when dealing with unverifiable tasks
Key Insight
💡 Token-level reasoning optimization with rubric-gating can improve LLM performance on unverifiable tasks
Share This
🤖 Improve LLM reasoning on unverifiable tasks with Direct Reasoning Optimization! 📈
Full Article
Title: Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks
Abstract:
arXiv:2506.13351v3 Announce Type: replace-cross Abstract: Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level
Abstract:
arXiv:2506.13351v3 Announce Type: replace-cross Abstract: Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level
DeepCamp AI