Structured Role-Aware Policy Optimization for Multimodal Reasoning

📰 ArXiv cs.AI

arXiv:2605.07274v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reasoning, final-answer rewards are typically assigned at the sequence level and do not distinguish the functional roles of different tokens, making it difficult to determine whether a correct answer is sup

Published 11 May 2026

Read full paper → ← Back to Reads