Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
In this video I will explain Direct Preference Optimization (DPO), an alignment technique for language models introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model".
I start by introducing language models and how they are used for text generation. After briefly introducing the topic of AI alignment, I start by reviewing Reinforcement Learning (RL), a topic that is necessary to understand the reward model and its loss function.
I derive step by step the loss function of the reward model under the Bradley-Terry model of preferences, a derivation …
Watch on YouTube ↗
(saves to browser)
Chapters (10)
Introduction
2:10
Intro to Language Models
4:08
AI Alignment
5:11
Intro to RL
8:19
RL for Language Models
10:44
Reward model
13:07
The Bradley-Terry model
21:34
Optimization Objective
29:52
DPO: deriving its loss
41:05
Computing the log probabilities
DeepCamp AI