Reinforcement Learning, RLHF, & DPO Explained
Learn how Reinforcement Learning from Human Feedback (RLHF) actually works and why Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are changing the game.
This video doesn't go deep on math. Instead, I provide a high-level overview of each technique to help you make practical decisions about where to focus your time and energy.
0:52 The Idea of Reinforcement Learning
1:55 Reinforcement Learning from Human Feedback (RLHF)
4:21 RLHF in a Nutshell
5:06 RLHF Variations
6:11 Challenges with RLHF
7:02 Direct Preference Optimization (DPO)
7:47 Preferences Dataset Example…
Watch on YouTube ↗
(saves to browser)
Chapters (17)
0:52
The Idea of Reinforcement Learning
1:55
Reinforcement Learning from Human Feedback (RLHF)
4:21
RLHF in a Nutshell
5:06
RLHF Variations
6:11
Challenges with RLHF
7:02
Direct Preference Optimization (DPO)
7:47
Preferences Dataset Example
8:29
DPO in a Nutshell
9:25
DPO Advantages over RLHF
10:32
Challenges with DPO
10:50
Kahneman-Tversky Optimization (KTO)
11:39
Prospect Theory
13:35
Sigmoid vs Value Function
13:49
KTO Dataset
15:28
KTO in a Nutshell
15:54
Advantages of KTO
18:03
KTO Hyperparameters
DeepCamp AI