Efficient Exploration for Iterative Nash Preference Optimization
📰 ArXiv cs.AI
arXiv:2606.01382v1 Announce Type: cross Abstract: Preference alignment is central to improving large language models, but standard reward-based formulations can be restrictive when human preferences are cyclic, non-transitive, or otherwise not representable by a scalar reward. Nash Learning from Human Feedback (NLHF) addresses this limitation by modeling alignment as a preference game and targeting a Nash equilibrium rather than a reward maximizer. However, the learning-theoretic foundations of
DeepCamp AI