Listener-Rewarded Thinking in VLMs for Image Preferences
📰 ArXiv cs.AI
arXiv:2506.22832v3 Announce Type: replace-cross Abstract: Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we unco
DeepCamp AI