Structured Role-Aware Policy Optimization for Multimodal Reasoning
📰 ArXiv cs.AI
arXiv:2605.07274v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reasoning, final-answer rewards are typically assigned at the sequence level and do not distinguish the functional roles of different tokens, making it difficult to determine whether a correct answer is sup
DeepCamp AI