HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

📰 ArXiv cs.AI

arXiv:2604.20140v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., D

Published 23 Apr 2026
Read full paper → ← Back to Reads