Efficient Exploration for Iterative Nash Preference Optimization

📰 ArXiv cs.AI

arXiv:2606.01382v1 Announce Type: cross Abstract: Preference alignment is central to improving large language models, but standard reward-based formulations can be restrictive when human preferences are cyclic, non-transitive, or otherwise not representable by a scalar reward. Nash Learning from Human Feedback (NLHF) addresses this limitation by modeling alignment as a preference game and targeting a Nash equilibrium rather than a reward maximizer. However, the learning-theoretic foundations of

Published 2 Jun 2026

Read full paper → ← Back to Reads