Visual Preference Optimization with Rubric Rewards
📰 ArXiv cs.AI
arXiv:2604.13029v1 Announce Type: cross Abstract: The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-
DeepCamp AI