Visual Preference Optimization with Rubric Rewards

📰 ArXiv cs.AI

arXiv:2604.13029v1 Announce Type: cross Abstract: The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-

Published 15 Apr 2026

Read full paper → ← Back to Reads