Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs
📰 ArXiv cs.AI
Modern LLMs' strong priors limit the effectiveness of post-training data-selection strategies, making Random sampling a strong baseline in online Direct Preference Optimization
Action Steps
- Evaluate the performance of uncertainty-based Active Preference Learning (APL) against Random sampling in online DPO
- Assess the impact of modern LLMs' strong priors on post-training data-selection strategies
- Consider the trade-offs between query efficiency and the richness of on-policy candidate pools
- Investigate the conditions under which Random sampling outperforms APL in online DPO
Who Needs to Know This
Machine learning researchers and engineers working on LLMs and online DPO can benefit from understanding the limitations of active selection strategies, and how Random sampling can be a surprisingly effective approach
Key Insight
💡 Modern LLMs' strong priors can limit the effectiveness of active selection strategies, making Random sampling a surprisingly effective approach
Share This
🤖 Random sampling can be a strong baseline in online DPO with modern LLMs! 📊
DeepCamp AI