Training LLM to play chess using Deepseek GRPO reinforcement learning

Efficient NLP · Beginner ·🧠 Large Language Models ·1y ago
Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io In this video, we see how popular LLMs like GPT-4o, o1 Reasoning, and DeepSeek R1 show some understanding of chess, they often fail to play legal moves. To address this, we train our own reasoning-focused chess LLM using the Group Relative Policy Optimization (GRPO) method introduced in DeepSeek R1. We walk through how GRPO differs from traditional PPO (Proximal Policy Optimization) and fine-tune LLaMA 8B and Qwen 7B using TRL (Transformers Reinforcement Learning) and Unsloth libraries - the results a…
Watch on YouTube ↗ (saves to browser)

Chapters (14)

Introduction
1:18 Chess RL Strategy
3:51 How well do the best LLMs understand chess?
6:41 Picking a base model
8:31 Unsloth and TRL libraries for RL with LLMs
9:38 LoRA (Low Rank Adaptation)
10:55 GSM8K reasoning example
12:06 PPO (Proximal Policy Optimization)
14:12 GRPO (Group Relative Policy Optimization)
17:15 GRPO training results
18:11 Analysis of results for LLaMA and Qwen
20:52 Limitations of GRPO on small models
23:29 Grandmaster-level chess without search
27:10 ChessGPT and other LLMs that play chess
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)