Optimizing Reinforcement Learning at Trillion-Parameter Scale - Songlin Jiang

PyTorch · Advanced ·📄 Research Papers Explained ·3w ago
Optimizing Reinforcement Learning at Trillion-Parameter Scale - Songlin Jiang, Aalto University & Mind Lab This talk will dive into how we implemented and optimized reinforcement learning on trillion-parameter Mixture-of-Experts reasoning models using veRL, Megatron-Bridge and vLLM. The session is useful to anyone building large-scale RL training systems. For the first part, I will walk through the system design required to make RL work at this scale using LoRA: how LoRA adapters are implemented for expert layers, how adapters are sharded and fused under tensor/pipeline/expert parallelism, and most importantly, how refit (parameter sync) is implemented for LoRA between training backend (Megatron) and rollout engine (vLLM). The second part of the talk focuses on training–inference mismatch in MoE RL. I will explain why common mitigations such as clipping and importance sampling can fail, and how we implement fixed Router Replay R3 across vLLM, veRL, and Megatron to align routing decisions between rollout and training. These works are done together with Mind Lab and some of the related blog posts are at: - https://macaron.im/mindlab/research/building-trillion-parameter-reasoning-rl-with-10-gpus - https://macaron.im/mindlab/research/router-replay-r3-why-it-failed-and-how-we-fixed-it
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

The ABCs of reading medical research and review papers these days
Learn to critically evaluate medical research papers by accepting nothing at face value, believing no one blindly, and checking everything
Medium · LLM
#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.
Learn to manage research paper tabs efficiently and apply meta-research techniques to improve productivity
Dev.to AI
How to Set Up a Karpathy-Style Wiki for Your Research Field
Learn to set up a Karpathy-style wiki for your research field to organize and share knowledge effectively
Medium · AI
The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap
Scientific knowledge may be stuck in a local minimum, hindering optimal progress, and understanding this concept is crucial for advancing research
ArXiv cs.AI
Up next
Microsoft Research Forum | Season 2, Episode 4
Microsoft Research
Watch →