GRPO - Group Relative Policy Optimization - How DeepSeek trains reasoning models

Name: GRPO - Group Relative Policy Optimization - How DeepSeek trains reasoning models
Uploaded: 2025-05-08T11:55:41+00:00
Channel: Serrano.Academy
Description: GRPO is what DeepSeek used to train its amazing reasoning model. The biggest innovation comes from using reinforcement learning to get the model to impr...

Serrano.Academy · Beginner ·🧠 Large Language Models ·10mo ago

GRPO is what DeepSeek used to train its amazing reasoning model. The biggest innovation comes from using reinforcement learning to get the model to improve itself, as opposed to self supervised learning. Learn all about it in this friendly video! Other videos in RL for LLms: Deep Reinforcement Learning: https://www.youtube.com/watch?v=SgC6AZss478 Reinforcement Learning with Human Feedback (RLHF): https://www.youtube.com/watch?v=Z_JUqJBpVOk Proximal Policy Optimization (PPO): https://www.youtube.com/watch?v=TjHH_--7l8g Direct Preference Optimization (DPO): https://www.youtube.com/watch?v=k2pD…

Watch on YouTube ↗ (saves to browser)