GRPO - Group Relative Policy Optimization - How DeepSeek trains reasoning models

Serrano.Academy · Beginner ·🧠 Large Language Models ·10mo ago
GRPO is what DeepSeek used to train its amazing reasoning model. The biggest innovation comes from using reinforcement learning to get the model to improve itself, as opposed to self supervised learning. Learn all about it in this friendly video! Other videos in RL for LLms: Deep Reinforcement Learning: https://www.youtube.com/watch?v=SgC6AZss478 Reinforcement Learning with Human Feedback (RLHF): https://www.youtube.com/watch?v=Z_JUqJBpVOk Proximal Policy Optimization (PPO): https://www.youtube.com/watch?v=TjHH_--7l8g Direct Preference Optimization (DPO): https://www.youtube.com/watch?v=k2pD…
Watch on YouTube ↗ (saves to browser)

Chapters (9)

Introduction
0:26 Answering with context
1:40 DeepSeek vs ChatGPT
5:30 The GRPO score
7:05 Averaging over answers and steps
7:38 Quality (Advantage)
10:30 Probability of responses
15:36 Clipping the response
18:21 Not changing the model too much
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)