How to finetune LLMs to THINK with Reinforcement Learning (GRPO from scratch!)
In this hands-on tutorial video, I am explaining Reasoning LLMs and SLMs and writing the Group Relative Policy Optimization (GRPO) algorithm from scratch in Pytorch. This tutorial is specially directed towards Small Language Models (SLMs) but the same principles apply for Large Language Models (LLMs) too. Plus, we are going through the policy gradient equation, explaining RLVR (reinforcement learning with verifiable rewards), and visualizing exactly how reasoning models work!
All materials with this video (as well as all other videos in the channel) have been shared on my Patreon page.
https…
Watch on YouTube ↗
(saves to browser)
Chapters (10)
Thinking LLMs are taking over!
3:47
Setting up Reinforcement Learning Environment
4:50
Reasoning Gym library - Rewards
8:00
GRPO Visually explained
10:41
Policy Optimization and PPO loss Explained
15:45
Coding response generation
20:55
Coding Reward Generation & Advantages
26:25
Calculating log probabilities
30:58
RL Training loop
33:49
Visualizing
DeepCamp AI