From Responses To Trajectories: Multi-Turn and Multi-Environ... Kashif Rasul & Sergio Paniego Blanco

Name: From Responses To Trajectories: Multi-Turn and Multi-Environ... Kashif Rasul & Sergio Paniego Blanco
Uploaded: 2026-04-20T20:21:45Z
Channel: PyTorch
Description: From Responses To Trajectories: Multi-Turn and Multi-Environment Reinforcement Learning - Kashif Rasul & Sergio Paniego Blanco, Hugging Face Post-traini...

PyTorch · Advanced ·🤖 AI Agents & Automation ·3w ago

Skills: LLM Engineering90%Agent Foundations60%

From Responses To Trajectories: Multi-Turn and Multi-Environment Reinforcement Learning - Kashif Rasul & Sergio Paniego Blanco, Hugging Face Post-training of LLMs with reinforcement learning is increasingly moving beyond static prompt–response pairs and preference optimization methods such as DPO, toward trajectory-based optimization. This talk focuses on the latest advances in multi-turn and multi-environment GRPO training, enabling LLMs to learn from interactive, agent-like experiences, including interacting with simulated environments, using tools, or completing multi-step reasoning tasks. We highlight how TRL, as a PyTorch-native post-training framework, supports these workflows at scale. Multi-turn, multi-environment training can leverage simulated environments (i.e., coding, terminals, browsers) such as OpenEnv, while GRPO can also be applied to datasets for training LLMs on tool use or multi-step reasoning. Attendees will gain insights into design patterns, rollout handling, trajectory batching, and advantage computation, showing how robust, multi-turn, multi-environment post-training can improve alignment, reasoning, and generalization in LLMs for agentic applications.

Watch on YouTube ↗ (saves to browser)