Reinforcement Learning for Multi-Turn Software Engineering Agents

PaperVideos · Advanced ·🧠 Large Language Models ·11mo ago

Key Takeaways

This video teaches reinforcement learning for training large language models as software engineering agents to solve complex, multi-turn interaction problems.

Full Transcript

What if an AI could teach itself to code? I'm not just talking about spitting out code snippets. I mean really learn from its own mistakes through trial and error, just like a human developer does. Well, some incredible new research is making that a reality. And we're going to break it all down. All right, here's the plan. First, we'll tackle the big question. Why is coding so hard for AI in the first place? Then, we'll get into the secret sauce, learning from trial and error. We'll walk through the two-part training plan they used. check out the incredible test results and then we'll wrap up by looking at what's next for AI coders. Okay, so first things first, let's really set the stage here. The big problem everyone's trying to crack is how to get an AI agent to be a genuinely useful software engineer in the real world. And you know what? It's way, way harder than you might think. And that really boils down to this one question. Can we get an AI to act like an actual human developer? We're not just talking about autocompleting a line of code here. We're talking about an AI that can dive into a messy codebase, figure out what a bug report is really saying, and then step by step work the problem just like a single dev would. Okay? To really get why this is so tough, you have to understand this key difference. Think about it. A lot of AI tasks are what you'd call single turn. It's like a math problem. One question, one answer, boom, done. But fixing a software bug, no way. That's a multi-turn task. It's more like a back and forth conversation. The AI has to write some code, run a test, see it fail, figure out why it failed, and then decide what to try next. Every single move depends on the one before it. So, the old ways of trying to build AI coders, they kind of hit a wall. They usually lean on these gigantic proprietary models, you know, the ones you can't even look inside. Plus, they need tons and tons of perfect human written examples to learn from, these so-called teacher models. And you guessed it, all of that is super resource inensive and expensive. It was pretty clear we needed a totally new game plan, something more efficient, something anyone could use, and something that could actually learn for itself. And that's where the big idea comes in. The game changer, reinforcement learning. See, instead of just spoon feeding the AI perfect answers, what if you let it learn by actually getting its hands dirty by doing the work? So, what exactly is reinforcement learning or RL? The easiest way to think about it is like training a dog. The AI agent, our dog, tries something, an action. In this case, it writes a piece of code. If that action works out well, say a test passes, it gets a treat, a reward. If it doesn't, it gets a penalty. And over and over again, it starts to figure out which actions lead to those rewards. It is quite literally learning from experience. And here's what that learning loop looks like in practice. Step one, the AI agent takes an action, like editing a code file. Step two, the environment, the codebase, talks back. It might say, "Nope, compiler error." Or, "Hey, that test failed." Step three, based on that feedback, the agent learns. It gets a penalty for the failure. And finally, step four, it adjusts its game plan for the next try. And this loop just keeps going, getting a little bit smarter with every single pass. Now, for you tech heads out there, they didn't just grab any old RL algorithm off the shelf. Nope. They used a specially modified version of something called DAPO. That's decoupled advantage policy optimization. All you really need to know is that this version is customuilt for these long drawn out problems where you might not know if you succeeded until the very very end. All right, so we've got the what reinforcement learning. Now let's get into the how. How did they actually train this thing? Turns out it was a really clever two-step process. So phase one is all about building a solid foundation. They call it rejection fine-tuning or RFT. Basically, they took a base AI model and just let it try to solve thousands of problems. Then they threw out all the failed attempts and kept only the ones that worked. They use this collection of successes to give the model a good baseline for what a correct solution looks like. But then comes phase two, the main event. They switch on the reinforcement learning and let the model actively practice, learn, and improve from its own successes and failures over and over for more than 100 training iterations. And you can see right away how much that first phase helped. The original base model, it was only successful about 11% of the time. But after just that first round of rejection fine-tuning, boom, its performance nearly doubled to 20%. That gave the AI a much, much better launchpad before it even started the real trial and error learning. Now, that second RL phase had a few key tweaks that made all the difference. To deal with big, messy code bases, they doubled the model's memory. It's called a context window. all the way up to over 130,000 tokens. They also got really smart about the training data, focusing on problems that were in that Goldilock zone. Not too easy, not too hard, but just right for learning. And the reward, it was super simple, almost brutal. You get a one if you fix the bug, a zero if you don't. That's it. No partial credit. Okay, so after all that clever training, the big question is, did it work? How well did this thing actually perform? Let's dive into the results. And here it is, the big number, 39%. The final agent nailed a 39% success rate on a tough industry benchmark called S.WE bench. Now, think about that for a second. It started at 11%. It jumped to 20% after phase 1 and now it's almost doubled its performance again. That is a huge, huge leap and it's all thanks to the AI learning on its own. And the researchers themselves really hit on why this is such a big deal. They said, and I'm quoting here, that they did this without relying on any teacher models. That's the magic phrase. The agent didn't need a perfect answer key to study from. It figured out how to get this good all by itself, which is a massive step towards building AIS that can actually operate independently. And just to put that 39% into context, check out this chart. Here's our RL agent compared to some other top open-source models. It's performing right up there with the very best like DeepSeek and it's blowing some other big names out of the water. What this really proves is that the training method itself is incredibly powerful. It can take a good solid open model and turn it into an absolute top tier performer. So this is obviously a huge breakthrough, but you know the work is never really done, right? To wrap things up, let's look ahead at the next big hurdles we need to clear to get to those truly autonomous AI developers. And the paper points to three major challenges that are still out there. First is something called sparse rewards. Remember how the AI only gets a reward at the very end when the bug is fixed? Well, if it took 12 different steps to get there, how does it know which one was the genius move? That's the problem. And that's closely related to credit assignment. It's tough to pinpoint the exact action that led to success. And finally, there's uncertainty. Right now, this AI will try to tackle any problem you give it with full confidence. But what we really need is an AI that knows its own limits. One that can recognize when it's out of its depth and say, "Hey, I need some help here." And at the end of the day, solving these challenges isn't just about making the AI a better coder. It's about something much more fundamental. It's about building trust. That's the real key to creating AI agents that we can feel confident handing over complex, important, real world jobs to. Look, what this research shows is that we are moving incredibly fast. Yet, there are still big challenges ahead. But the idea of an AI agent being a skilled, independent software engineer, well, that's starting to feel less like sci-fi and more like an actual job description. Which leaves us all with a pretty wild question to chew on. How long is it really until an AI is your new senior developer?

Original Description

This research explores training large language models (LLMs) as software engineering (SWE) agents using reinforcement learning (RL), moving beyond single-turn problems to complex, multi-turn interactions. The authors introduce a modified Decoupled Advantage Policy Optimization (DAPO) algorithm to enhance an agent's ability to solve real-world SWE tasks. Their approach, which includes a two-phase training pipeline (rejection fine-tuning followed by multi-turn RL), significantly improves the agent's success rate on benchmarks like SWE-bench Verified. The study highlights the challenges of long-horizon interactions and sparse rewards in SWE, while demonstrating RL's potential for building more capable autonomous agents from open-weight models. The work also details algorithmic modifications, hyperparameters, and infrastructure used to achieve these advancements.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Related Reads

COMING SOON: The AI Work OS built for true enterprise memory

Learn how Unified Orbis' AI Work OS transforms event streams into structured knowledge, revolutionizing enterprise memory and workflow efficiency

I Tried to Make My 6GB Laptop Run a 35B Model Faster.

Learn how to optimize a 6GB laptop to run a 35B model faster, exploring potential solutions and trade-offs

ChatGPT vs Claude: Which One Should You Actually Use?

Learn how to choose between ChatGPT and Claude for your needs, and why it matters for effective AI tool usage

Medium · ChatGPT

Build Your Own J.A.R.V.I.S. AI Assistant for FREE in 2026

Learn to build a free AI assistant like J.A.R.V.I.S. with voice, memory, vision, automation, and local LLMs in 2026

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)