Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

📰 ArXiv cs.AI

Optimize token-level reasoning in LLMs using Direct Reasoning Optimization, which combines Reasoning Reflection Reward with rubric-gating for unverifiable tasks

advanced Published 11 May 2026

Action Steps

Define a token-level dense Reasoning Reflection Reward (R3) to measure reasoning quality
Implement rubric-gating as feasibility constraints at the rollout group level
Optimize R3 using reinforcement learning (RL) training
Enforce rubric-gating constraints during RL training
Evaluate the performance of the optimized model on unverifiable tasks

Who Needs to Know This

NLP engineers and researchers can benefit from this technique to improve the reasoning quality of their LLMs, especially when dealing with unverifiable tasks

Key Insight

💡 Token-level reasoning optimization with rubric-gating can improve LLM performance on unverifiable tasks

Full Article

Title: Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

Abstract:
arXiv:2506.13351v3 Announce Type: replace-cross Abstract: Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level

Read full paper → ← Back to Reads