An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models

📰 ArXiv cs.AI

Empirical study on SFT-DPO interaction and parameterization in small language models shows DPO yields small task-dependent gains

advanced Published 23 Mar 2026

Action Steps

Compare SFT-only, DPO-only, and staged SFT-to-DPO training methods
Evaluate the performance of each method on tasks like paraphrase detection and Shakespearean sonnet continuation
Consider using full fine-tuning (FFT) or LoRA as alternatives to SFT and DPO
Analyze the results to determine the most effective approach for small language models

Who Needs to Know This

ML researchers and engineers working on language models can benefit from this study to optimize their models' performance, especially when using small backbones and modest data

Key Insight

💡 DPO can provide small, task-dependent improvements in language model performance, but its effectiveness depends on the specific task and model architecture