Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Umar Jamil · Beginner ·📄 Research Papers Explained ·2y ago

Skills: Research Methods90%Reading ML Papers80%Paper Reproduction70%LLM Foundations60%

In this video I will explain Direct Preference Optimization (DPO), an alignment technique for language models introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". I start by introducing language models and how they are used for text generation. After briefly introducing the topic of AI alignment, I start by reviewing Reinforcement Learning (RL), a topic that is necessary to understand the reward model and its loss function. I derive step by step the loss function of the reward model under the Bradley-Terry model of preferences, a derivation that is missing in the DPO paper. Using the Bradley-Terry model, I build the loss of the DPO algorithm, not only explaining its math derivation, but also giving intuition on how it works. In the last part, I describe how to use the loss practically, that is, how to calculate the log probabilities using a Transformer model, by showing how it is implemented in the Hugging Face library. DPO paper: Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S. and Finn, C., 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2305.18290 If you're interested in how to derive the optimal solution to the RL constrained optimization problem, I highly recommend the following paper (Appendinx A, equation 36): Peng XB, Kumar A, Zhang G, Levine S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177. 2019 Oct 1. https://arxiv.org/abs/1910.00177 Slides PDF: https://github.com/hkproj/dpo-notes Chapters 00:00:00 - Introduction 00:02:10 - Intro to Language Models 00:04:08 - AI Alignment 00:05:11 - Intro to RL 00:08:19 - RL for Language Models 00:10:44 - Reward model 00:13:07 - The Bradley-Terry model 00:21:34 - Optimization Objective 00:29:52 - DPO: deriving its loss 00:41:05 - Computing the log probabilities 00:47:

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: Research Methods

View skill →

Mechanics of Materials III: Beam Bending

Mechanics of Materials III: Beam Bending

Inaugural Lecture: Juliane Reinecke

Inaugural Lecture: Juliane Reinecke

Saïd Business School, University of Oxford

Hands-On Learning: How and Why You Should Build a Home Lab

Hands-On Learning: How and Why You Should Build a Home Lab

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

NVIDIA cuOpt Wins the 2025 COIN-OR Cup

NVIDIA cuOpt Wins the 2025 COIN-OR Cup

NVIDIA Developer

Framework for Data Collection and Analysis

Framework for Data Collection and Analysis

Related AI Lessons

The ABCs of reading medical research and review papers these days

Learn to critically evaluate medical research papers by accepting nothing at face value, believing no one blindly, and checking everything

#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.

Learn to manage research paper tabs efficiently and apply meta-research techniques to improve productivity

How to Set Up a Karpathy-Style Wiki for Your Research Field

Learn to set up a Karpathy-style wiki for your research field to organize and share knowledge effectively

The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap

Scientific knowledge may be stuck in a local minimum, hindering optimal progress, and understanding this concept is crucial for advancing research

Chapters (10)

Introduction

2:10 Intro to Language Models

4:08 AI Alignment

5:11 Intro to RL

8:19 RL for Language Models

10:44 Reward model

13:07 The Bradley-Terry model

21:34 Optimization Objective

29:52 DPO: deriving its loss

41:05 Computing the log probabilities

Kimi AI's Huge LLM Breakthrough Is Fascinating [Attention Residuals]