Reinforcement Learning for Multi-Turn Software Engineering Agents
This research explores training large language models (LLMs) as software engineering (SWE) agents using reinforcement learning (RL), moving beyond single-turn problems to complex, multi-turn interactions. The authors introduce a modified Decoupled Advantage Policy Optimization (DAPO) algorithm to enhance an agent's ability to solve real-world SWE tasks. Their approach, which includes a two-phase training pipeline (rejection fine-tuning followed by multi-turn RL), significantly improves the agent's success rate on benchmarks like SWE-bench Verified. The study highlights the challenges of long-h…
Watch on YouTube ↗
(saves to browser)
DeepCamp AI