Listener-Rewarded Thinking in VLMs for Image Preferences

📰 ArXiv cs.AI

arXiv:2506.22832v3 Announce Type: replace-cross Abstract: Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we unco

Published 13 Apr 2026

Read full paper → ← Back to Reads