How to finetune LLMs to THINK with Reinforcement Learning (GRPO from scratch!)

Neural Breakdown with AVB · Advanced ·🧠 Large Language Models ·9mo ago
In this hands-on tutorial video, I am explaining Reasoning LLMs and SLMs and writing the Group Relative Policy Optimization (GRPO) algorithm from scratch in Pytorch. This tutorial is specially directed towards Small Language Models (SLMs) but the same principles apply for Large Language Models (LLMs) too. Plus, we are going through the policy gradient equation, explaining RLVR (reinforcement learning with verifiable rewards), and visualizing exactly how reasoning models work! All materials with this video (as well as all other videos in the channel) have been shared on my Patreon page. https…
Watch on YouTube ↗ (saves to browser)

Chapters (10)

Thinking LLMs are taking over!
3:47 Setting up Reinforcement Learning Environment
4:50 Reasoning Gym library - Rewards
8:00 GRPO Visually explained
10:41 Policy Optimization and PPO loss Explained
15:45 Coding response generation
20:55 Coding Reward Generation & Advantages
26:25 Calculating log probabilities
30:58 RL Training loop
33:49 Visualizing
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)