Reinforcement Learning, RLHF, & DPO Explained
Learn how Reinforcement Learning from Human Feedback (RLHF) actually works and why Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are changing the game.
This video doesn't go deep on math. Instead, I provide a high-level overview of each technique to help you make practical decisions about where to focus your time and energy.
0:52 The Idea of Reinforcement Learning
1:55 Reinforcement Learning from Human Feedback (RLHF)
4:21 RLHF in a Nutshell
5:06 RLHF Variations
6:11 Challenges with RLHF
7:02 Direct Preference Optimization (DPO)
7:47 Preferences Dataset Example
8:29 DPO in a Nutshell
9:25 DPO Advantages over RLHF
10:32 Challenges with DPO
10:50 Kahneman-Tversky Optimization (KTO)
11:39 Prospect Theory
13:35 Sigmoid vs Value Function
13:49 KTO Dataset
15:28 KTO in a Nutshell
15:54 Advantages of KTO
18:03 KTO Hyperparameters
These are the three papers referenced in the video:
1. Deep reinforcement learning from human preferences (https://arxiv.org/abs/1706.03741)
2. Direct Preference Optimization:
Your Language Model is Secretly a Reward Model (https://arxiv.org/abs/2305.18290)
3. KTO: Model Alignment as Prospect Theoretic Optimization (https://arxiv.org/abs/2402.01306)
The Huggingface TRL library offers implementations for PPO, DPO, and KTO:
https://huggingface.co/docs/trl/main/en/kto_trainer
Want to prototype with prompts and supervised fine-tuning? Try Entry Point AI:
https://www.entrypointai.com/
How about connecting? I'm on LinkedIn:
https://www.linkedin.com/in/markhennings/
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Research Methods
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The ABCs of reading medical research and review papers these days
Medium · LLM
#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.
Dev.to AI
How to Set Up a Karpathy-Style Wiki for Your Research Field
Medium · AI
The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap
ArXiv cs.AI
Chapters (17)
0:52
The Idea of Reinforcement Learning
1:55
Reinforcement Learning from Human Feedback (RLHF)
4:21
RLHF in a Nutshell
5:06
RLHF Variations
6:11
Challenges with RLHF
7:02
Direct Preference Optimization (DPO)
7:47
Preferences Dataset Example
8:29
DPO in a Nutshell
9:25
DPO Advantages over RLHF
10:32
Challenges with DPO
10:50
Kahneman-Tversky Optimization (KTO)
11:39
Prospect Theory
13:35
Sigmoid vs Value Function
13:49
KTO Dataset
15:28
KTO in a Nutshell
15:54
Advantages of KTO
18:03
KTO Hyperparameters
🎓
Tutor Explanation
DeepCamp AI