Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

📰 ArXiv cs.AI

Modern LLMs' strong priors limit the effectiveness of post-training data-selection strategies, making Random sampling a strong baseline in online Direct Preference Optimization

advanced Published 6 Apr 2026
Action Steps
  1. Evaluate the performance of uncertainty-based Active Preference Learning (APL) against Random sampling in online DPO
  2. Assess the impact of modern LLMs' strong priors on post-training data-selection strategies
  3. Consider the trade-offs between query efficiency and the richness of on-policy candidate pools
  4. Investigate the conditions under which Random sampling outperforms APL in online DPO
Who Needs to Know This

Machine learning researchers and engineers working on LLMs and online DPO can benefit from understanding the limitations of active selection strategies, and how Random sampling can be a surprisingly effective approach

Key Insight

💡 Modern LLMs' strong priors can limit the effectiveness of active selection strategies, making Random sampling a surprisingly effective approach

Share This
🤖 Random sampling can be a strong baseline in online DPO with modern LLMs! 📊
Read full paper → ← Back to News