Reinforcement Learning, RLHF, & DPO Explained

Mark Hennings · Advanced ·📄 Research Papers Explained ·1y ago
Learn how Reinforcement Learning from Human Feedback (RLHF) actually works and why Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are changing the game. This video doesn't go deep on math. Instead, I provide a high-level overview of each technique to help you make practical decisions about where to focus your time and energy. 0:52 The Idea of Reinforcement Learning 1:55 Reinforcement Learning from Human Feedback (RLHF) 4:21 RLHF in a Nutshell 5:06 RLHF Variations 6:11 Challenges with RLHF 7:02 Direct Preference Optimization (DPO) 7:47 Preferences Dataset Example…
Watch on YouTube ↗ (saves to browser)

Chapters (17)

0:52 The Idea of Reinforcement Learning
1:55 Reinforcement Learning from Human Feedback (RLHF)
4:21 RLHF in a Nutshell
5:06 RLHF Variations
6:11 Challenges with RLHF
7:02 Direct Preference Optimization (DPO)
7:47 Preferences Dataset Example
8:29 DPO in a Nutshell
9:25 DPO Advantages over RLHF
10:32 Challenges with DPO
10:50 Kahneman-Tversky Optimization (KTO)
11:39 Prospect Theory
13:35 Sigmoid vs Value Function
13:49 KTO Dataset
15:28 KTO in a Nutshell
15:54 Advantages of KTO
18:03 KTO Hyperparameters
Why Weak Passwords Put You at Risk
Next Up
Why Weak Passwords Put You at Risk
Coursera