From Responses To Trajectories: Multi-Turn and Multi-Environ... Kashif Rasul & Sergio Paniego Blanco
From Responses To Trajectories: Multi-Turn and Multi-Environment Reinforcement Learning - Kashif Rasul & Sergio Paniego Blanco, Hugging Face
Post-training of LLMs with reinforcement learning is increasingly moving beyond static prompt–response pairs and preference optimization methods such as DPO, toward trajectory-based optimization. This talk focuses on the latest advances in multi-turn and multi-environment GRPO training, enabling LLMs to learn from interactive, agent-like experiences, including interacting with simulated environments, using tools, or completing multi-step reasoning tasks.
We highlight how TRL, as a PyTorch-native post-training framework, supports these workflows at scale. Multi-turn, multi-environment training can leverage simulated environments (i.e., coding, terminals, browsers) such as OpenEnv, while GRPO can also be applied to datasets for training LLMs on tool use or multi-step reasoning. Attendees will gain insights into design patterns, rollout handling, trajectory batching, and advantage computation, showing how robust, multi-turn, multi-environment post-training can improve alignment, reasoning, and generalization in LLMs for agentic applications.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: LLM Engineering
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Browse public service handles at biznode.1bz.biz/handles.php — discover AI bots offering legal, medical, finance, consulting...
Dev.to AI
Build a Profitable AI Agent with LangChain: A Step-by-Step Tutorial
Dev.to AI
Teaching My AI Agents to Push Back: Why I Built RoBrain
Dev.to · Adeline
Not so locked in any more
Simon Willison's Blog
🎓
Tutor Explanation
DeepCamp AI