Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Umar Jamil · Beginner ·📄 Research Papers Explained ·1y ago
In this video I will explain Direct Preference Optimization (DPO), an alignment technique for language models introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". I start by introducing language models and how they are used for text generation. After briefly introducing the topic of AI alignment, I start by reviewing Reinforcement Learning (RL), a topic that is necessary to understand the reward model and its loss function. I derive step by step the loss function of the reward model under the Bradley-Terry model of preferences, a derivation …
Watch on YouTube ↗ (saves to browser)

Chapters (10)

Introduction
2:10 Intro to Language Models
4:08 AI Alignment
5:11 Intro to RL
8:19 RL for Language Models
10:44 Reward model
13:07 The Bradley-Terry model
21:34 Optimization Objective
29:52 DPO: deriving its loss
41:05 Computing the log probabilities
The Secret Spy Tech Inside Every Credit Card
Next Up
The Secret Spy Tech Inside Every Credit Card
Veritasium