Reinforcement Learning, RLHF, & DPO Explained

Mark Hennings · Advanced ·📄 Research Papers Explained ·1y ago

Skills: Research Methods90%Reading ML Papers90%LLM Foundations80%Paper Reproduction80%

Learn how Reinforcement Learning from Human Feedback (RLHF) actually works and why Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are changing the game. This video doesn't go deep on math. Instead, I provide a high-level overview of each technique to help you make practical decisions about where to focus your time and energy. 0:52 The Idea of Reinforcement Learning 1:55 Reinforcement Learning from Human Feedback (RLHF) 4:21 RLHF in a Nutshell 5:06 RLHF Variations 6:11 Challenges with RLHF 7:02 Direct Preference Optimization (DPO) 7:47 Preferences Dataset Example 8:29 DPO in a Nutshell 9:25 DPO Advantages over RLHF 10:32 Challenges with DPO 10:50 Kahneman-Tversky Optimization (KTO) 11:39 Prospect Theory 13:35 Sigmoid vs Value Function 13:49 KTO Dataset 15:28 KTO in a Nutshell 15:54 Advantages of KTO 18:03 KTO Hyperparameters These are the three papers referenced in the video: 1. Deep reinforcement learning from human preferences (https://arxiv.org/abs/1706.03741) 2. Direct Preference Optimization: Your Language Model is Secretly a Reward Model (https://arxiv.org/abs/2305.18290) 3. KTO: Model Alignment as Prospect Theoretic Optimization (https://arxiv.org/abs/2402.01306) The Huggingface TRL library offers implementations for PPO, DPO, and KTO: https://huggingface.co/docs/trl/main/en/kto_trainer Want to prototype with prompts and supervised fine-tuning? Try Entry Point AI: https://www.entrypointai.com/ How about connecting? I'm on LinkedIn: https://www.linkedin.com/in/markhennings/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: Research Methods

View skill →

Mechanics of Materials III: Beam Bending

Mechanics of Materials III: Beam Bending

Inaugural Lecture: Juliane Reinecke

Inaugural Lecture: Juliane Reinecke

Saïd Business School, University of Oxford

Hands-On Learning: How and Why You Should Build a Home Lab

Hands-On Learning: How and Why You Should Build a Home Lab

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

NVIDIA cuOpt Wins the 2025 COIN-OR Cup

NVIDIA cuOpt Wins the 2025 COIN-OR Cup

NVIDIA Developer

Framework for Data Collection and Analysis

Framework for Data Collection and Analysis

Related AI Lessons

The ABCs of reading medical research and review papers these days

Learn to critically evaluate medical research papers by accepting nothing at face value, believing no one blindly, and checking everything

#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.

Learn to manage research paper tabs efficiently and apply meta-research techniques to improve productivity

How to Set Up a Karpathy-Style Wiki for Your Research Field

Learn to set up a Karpathy-style wiki for your research field to organize and share knowledge effectively

The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap

Scientific knowledge may be stuck in a local minimum, hindering optimal progress, and understanding this concept is crucial for advancing research

Chapters (17)

0:52 The Idea of Reinforcement Learning

1:55 Reinforcement Learning from Human Feedback (RLHF)

4:21 RLHF in a Nutshell

5:06 RLHF Variations

6:11 Challenges with RLHF

7:02 Direct Preference Optimization (DPO)

7:47 Preferences Dataset Example

8:29 DPO in a Nutshell

9:25 DPO Advantages over RLHF

10:32 Challenges with DPO

10:50 Kahneman-Tversky Optimization (KTO)

11:39 Prospect Theory

13:35 Sigmoid vs Value Function

13:49 KTO Dataset

15:28 KTO in a Nutshell

15:54 Advantages of KTO

18:03 KTO Hyperparameters

X Revealed Their Secret Algorithm on Github #algorithm #twitter #tech

Analytics Vidhya