GRPO - Group Relative Policy Optimization - How DeepSeek trains reasoning models
GRPO is what DeepSeek used to train its amazing reasoning model. The biggest innovation comes from using reinforcement learning to get the model to improve itself, as opposed to self supervised learning. Learn all about it in this friendly video!
Other videos in RL for LLms:
Deep Reinforcement Learning: https://www.youtube.com/watch?v=SgC6AZss478
Reinforcement Learning with Human Feedback (RLHF): https://www.youtube.com/watch?v=Z_JUqJBpVOk
Proximal Policy Optimization (PPO): https://www.youtube.com/watch?v=TjHH_--7l8g
Direct Preference Optimization (DPO): https://www.youtube.com/watch?v=k2pD…
Watch on YouTube ↗
(saves to browser)
Chapters (9)
Introduction
0:26
Answering with context
1:40
DeepSeek vs ChatGPT
5:30
The GRPO score
7:05
Averaging over answers and steps
7:38
Quality (Advantage)
10:30
Probability of responses
15:36
Clipping the response
18:21
Not changing the model too much
DeepCamp AI