Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

📰 ArXiv cs.AI

Modern LLMs' strong priors limit the effectiveness of post-training data-selection strategies, making Random sampling a strong baseline in online Direct Preference Optimization

advanced Published 6 Apr 2026

Action Steps

Evaluate the performance of uncertainty-based Active Preference Learning (APL) against Random sampling in online DPO
Assess the impact of modern LLMs' strong priors on post-training data-selection strategies
Consider the trade-offs between query efficiency and the richness of on-policy candidate pools
Investigate the conditions under which Random sampling outperforms APL in online DPO

Who Needs to Know This

Machine learning researchers and engineers working on LLMs and online DPO can benefit from understanding the limitations of active selection strategies, and how Random sampling can be a surprisingly effective approach

Key Insight

💡 Modern LLMs' strong priors can limit the effectiveness of active selection strategies, making Random sampling a surprisingly effective approach