NEW AI Coding Agents for SWE: ENTROPY

Discover AI · Advanced ·📄 Research Papers Explained ·9mo ago

Skills: Research Methods90%Reading ML Papers85%RAG Basics80%Tool Use & Function Calling70%Multi-Agent Systems60%

Key Takeaways

The video discusses three new AI research papers that improve the performance of AI coding agents for software engineering using entropy regularization, with tools such as Entropo and DPO, and techniques like multi-turn preference optimization and retrieval augmented generation.

Full Transcript

Hello community. So great that you are back. Today we talk about software engineering and the latest research to optimize your EI coding agents. Now you know software engineering great LLMs here reasoning over larger and larger code bases and we have now the complexity of multiple tool users. So what about the reasoning process here over code over large code bases and now having tools like search execution patching etc. And we are gaining complexity. So how can we improve now our systems? Now SV bench you know here you see here for bash only with a minimal agent configuration. Those are the performance data. Great. Now there is a challenge here to avoid here the mode collabs where the models here overfit here to only very narrow solution paths and this is particular if you have a configuration where we have here reinforcement learning by human feedback. We have a direct preference optimization here add a test time scaling. Absolutely. So how we do this? So we do not scale the parameters. We do here test time scaling in this particular instance for the codebase. What is the new idea? What is the new innovation? What is currently the topic of research? How to improve here our AI coding agents? Simply entropy regularization. Now I can explain this to you in within 5 seconds. Imagine we have a code base that grows larger and larger and more complex. You need more reasoning. You have more tool use. You have more dependencies of the tools. You have more agents that communicate. So the complexity increases here. At the same time you have your AI coding agent and you say great now you want imagine this here spans now a mathematical space. This is the solution space and somewhere in this space here is the perfect solution for your task. So what is happening in general an agent starts but we are somehow limited to a specific segment if you want in this mathematical solution space and let's say this is here all the complete space of solution that your agent will find for your particular topic. Now you immediately see this is not all the complete solution space. No this is just focus here in a very narrow subspace. So therefore you ask hey what about those solutions? What about those solutions? No. And you see the entropy is so so important. So we want to have a regularization of the entropy so that our coding agent explores here different region of the solution space and is not just focused here on this one. Of course this refers immediately back to the theoretical problem in reinforcement learning. Exploitation versus exploration. So what are the solution? Solution is here presenting your new research paper. Entropo is here. You have guessed it. Entropy regularization is task agnostic. This is beautiful because this is not just now for the complex reasoning domain of code generation of software engineering. But this would also go with mathematics or scientific discovery. But more about this in a later video. Let's stick here with the publication where we go here and we scale here for the rate for our agents with the compute test time compute for an entropy preservation to maintain here the exploration and complex sequential decision making for scanning codebase. Here is the publication September 15, 2025. Northwestern University Capital 1 and Ma Facebook building coding agents via an entropy enhanced multi-turn preference optimization. DPO you know everything beautiful. So let's have a look how can we optimize the next generation of coding agents. Now whatever you go DPO or we go here with economy optimization never mind whatever those methodology do they often reduce the policy entropy as I showed you you only focused in a particular sector a sub sector of the solution space you don't want this so therefore we have to introduce a new term and the authors did this with an entropy regularized framework that extends to those now at first to that multi-term conversation plus tool assisted mark of decision processes preserving here the diversity across all the different trajectories we're going to compute in parallel in test time compute great now if you're not familiar mock of decision process here very short summary great now what we find now what your found if they introduce now this entropy regularization term the entropy augmented the dpo loss this is it and if you go here with multi-turn then you have here the particular Q values that are defined recursively. Is this immediately clear? No. Because in the paper they have some really beautiful case where they showed you exactly for the single turn case and for the multi-turn case how they derive on this formula. But even this is not enough because if you look at the annex there you have the complete mathematical proof of the proposition 3.2. So here you go. And of course they also do this here for 3.3 for the multi-turn. So it is not as easy as it might seem. But I will ignore this for the moment. I will just give you here the result of this new entropy regularized DPU loss function. And here we have it. Yes, of course there is also additional annex here for how exactly mathematically to derive here the entropo loss function. But I guess we just take here the result and we run with the result. And let's have a look if it's really great. If you want do not want to restrict you to DPO, but you go here with the Conor Mantry here, prospect theory KTO the model alignment here as a prospectical optimization. You see this here in reference here to your PO clip and DPO. Great. So we are focused now on scaling. We are going with test time compute inference. So in TTS the agent generates no this is not by the way yes I know some of you might say hey this is text to speech no text to speech is text numerical to speech so current notation here all the university know in here TTS is here test computer inference the agent generates multiple candidate trajectories we go with 16 in parallel for any given problem instance whatever you go bug fixing code repo whatever you have and now we have the task only 16 trajectory but we have to select the optimal trajectory here for our coding agent. How to do this now or just sort about it and anthropo employs here a hybrid selector. This is interesting. You know whenever you're not really sure how to go you know what you do you take a little bit of this one and take a little bit of this one and then you say okay let's build a hybrid solution. This is what happened here. It is great and you guessed it. The first solution is of course here an AI intelligence and the second solution is here a simple rulebased where you say okay this is my uristic um rules that I want that the system applies. So nothing specific we have a probabilistic scoring and we have deterministic rules for the efficiency and interpretability leading here and this is the big question really to empirical improvements. Let's have a look at this. Yes, of course we go with mock of decision process and you know exactly why if you have seen my last videos. Now this particular process now for entropo is interesting. So at first we have this train this AI trained very fine if you want this is device oracle element no supervised learning model train on a preference data set which consist of label trajectory pairs. So we go here with preferred versus non-preferred for the oracle feedback something you know absolute no problem standard procedure. This verifier assigns of course standard probability score here in whatever interval you like to each trajectory to our 16 trajectory thereby estimating a likelihood of success for the training of this trained verifier. Yes of course for your domain specific task you have to have a training data set you have to train it but you can use here binary cross entropy loss here. This is simple. This is familiar. No problem at all to implement this. The second part with the deterministic uristic with the simple rule part now is not as easy. So no learn parameter instead we apply here domain specific criteria catalog derived here from software engineering principle what you want to achieve. In the simplest case they go with a binary indicator that says hey we only go with complete trajectories. Well, of course. Second, also a binary check that executes a repo wide regression test on the proposed patch. Great. And then they have a filter set with the Omax function. They say interestingly, we want to favor here longer trajectory under the hypothesis that they reflect here more server explorations. So, multiple code inspection, test execution or other iterative refinements before the patch submission itself. So okay there they go with this interesting for longer trajectories. Let's have a look look at the benchmark at the evaluation of those and here you see it here test time scaling on software SV bench very bench light look uh supervised finetuning here is this baseline here you see okay not really famous no and then multi-turn KTO we have in yellow yes it gets better here this multi-turn training here but really nice if we have now if we add now this entropy regularized term So we do not segment do not restrict ourselves to a segmentation of the solution space but we try to explore the complete solution space. Look at how much nicer our performance increases now almost uh 58 59%. So you see here exactly 3 to 5.6%. Performance gain with an entropy regularization term. Did we expect it? Yes of course we are looking for more normal ideas. And we are not here stick here to the old solution. We explore the solution space. So therefore in terms entropo outperforms your dpo by 5 to 10% absolute especially here with a test time computer. The scaling due to a higher entropy beautiful entropy term is crucial. Without this we would not achieve this beautiful results. If you want to have here the numerical table from all the benchmark data this is for you. You see our two entropy. Great. So what is the summary? Easy. No, that was simple paper highlights that the entropy regularization in multi-term preference optimization is key for our agentic system. We looked at today in this video at agentic system for coding with if you want theoretical guarantees enabling here scalable test compute diverse exploration in software engineering. Great. And you know what the beauty of this idea since it is if you want also applicable to mathematics to theoretical physics or any other if you want science experiment. This is nice to explore this idea not just for coding but also for if you want reasoning complexities here for visual complexities for example. But of course, it's not the only study I wanted to show you here in this video today because look here, September 2nd, 2025 when agents go astray course correcting SV agent here with also inference time process reward models. So you immediately understand here this is by Carnegie Mal University and IBM research. A very nice study have a look at this. This is another idea a very similar idea. you know interference time you say yes test time compute let's go with this and in this precious 10 20 30 seconds that we have here you have now a process reward model that you build up here to detect and course correct trajectory level errors that happen now in your agentic system so depending now how you optimize for what particular task you have to optimize test time compute scaling this might be another methodology ology to go with. And of course, I have another study here. This is here from September 11, 2025. They say, you know what, this is nice if you go just for an output evaluated result. What about the effectiveness? Is it really the shortest best solution or has the system been wandering around for 10 seconds just looking in one direction ignoring all the rest? So what about the effectiveness of our SWE under resource constraints if you have smaller models if you have not 10 20 seconds if you want to do it in 2 seconds. So what is now the optimization step if you go for effectiveness and here you have who have I Chinese University of Hong Kong King's College London and Queens University and they go here with a new SWE benchmark here using here a multi-dimensional metric trying to incorporate here the effectiveness that they measure here in a very interesting way have a look also at the third study and they say yeah we want to optimize here our if you on test time compute scaling performance. So you see where we are currently everybody is looking here at SWE to improve here to make it faster and to increase here the entropy regularization terms and yeah it looks good at least 5% maybe up to 10% performance jump for your next EI agent code systems. I hope you enjoyed it. See you in my next video.

Original Description

Three new AI research papers to further improve the performance of our AI agents for CODING, Software Enginering. New SWE BEnchmarks available on the new architecture, based on new entropy regularization. All rights w/ authors: "BUILDING CODING AGENTS VIA ENTROPY-ENHANCED MULTI-TURN PREFERENCE OPTIMIZATION" Jiahao Yu Northwestern University Zelei Cheng Capital One Xian Wu Meta Xinyu Xing Northwestern University arXiv:2509.12434 "When Agents go Astray: Course-Correcting SWE Agents with PRMs" Shubham Gandhi Carnegie Mellon University Jason Tsay IBM Research Jatin Ganhotra IBM Research Kiran Kate IBM Research Yara Rizk IBM Research arXiv:2509.02360 "SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints" Zhiyu Fan 1, Kirill Vasilevski 2, Dayi Lin 2, Boyuan Chen 2, Yihao Chen 2, Zhiqing Zhong 3, Jie M. Zhang 4, Pinjia He 3, Ahmed E. Hassan 5 from 1 Huawei 2 Huawei Canada 3 The Chinese University of Hong Kong, Shenzhen 4 King’s College London 5 Queen’s University arXiv:2509.09853 #ai #aicoding #coding #softwareengineering

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Discover AI · Discover AI · 0 of 60

← Previous Next →

Step Into the Unknown (by YouChat) - May 2023 be your best year yet

Step Into the Unknown (by YouChat) - May 2023 be your best year yet

Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!

Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!

Create a Smarter Future!

Create a Smarter Future!

The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers

The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers

Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models

Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models

Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business

Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business

Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)

Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)

Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)

Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)

SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)

SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)

Discover Vision Transformer (ViT) Tech in 2023

Discover Vision Transformer (ViT) Tech in 2023

Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)

Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)

Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI

Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI

BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM

BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM

Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts

Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts

From T5 to T5X: A Game-Changing Evolution with JAX & FLAX

From T5 to T5X: A Game-Changing Evolution with JAX & FLAX

How to start with ChatGPT? | Short Introduction to OpenAI API #shorts

How to start with ChatGPT? | Short Introduction to OpenAI API #shorts

The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor

The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor

Microsoft and ChatGPU

Microsoft and ChatGPU

From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch

From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch

Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI

Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI

TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP

TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP

3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer

3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer

FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!

FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!

ChatGPT - Can it Lie to you?

ChatGPT - Can it Lie to you?

ChatGPT Alternative: Perplexity by Perplexity.AI

ChatGPT Alternative: Perplexity by Perplexity.AI

2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2

2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2

Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING

Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING

BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!

BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!

Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A

Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A

Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!

Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!

After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?

After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?

Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT

Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT

Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct

Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct

The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE

The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE

New TECH: Vision Transformer 2023 on Image Classification | AI

New TECH: Vision Transformer 2023 on Image Classification | AI

PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech

PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech

New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!

New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!

New BING ChatGPT loses its mind

New BING ChatGPT loses its mind

Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)

Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)

Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI

Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI

Microsoft strongly restricts access to ChatGPT on new BING - WHY?

Microsoft strongly restricts access to ChatGPT on new BING - WHY?

PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)

PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)

New BING Chat AGGRESSIVE

New BING Chat AGGRESSIVE

Panoptic Image Segmentation: Mask2Former explained | Identify all objects!

Panoptic Image Segmentation: Mask2Former explained | Identify all objects!

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course

Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course

Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide

Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide

Microsoft's CEO in Trouble #shorts

Microsoft's CEO in Trouble #shorts

Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)

Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)

OpenAI's ChatGPT can NOW summarize external Sources on the Internet?

OpenAI's ChatGPT can NOW summarize external Sources on the Internet?

ChatGPT polarizes

ChatGPT polarizes

Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed

Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed

ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial

ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial

Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)

Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)

ChatGPT: Multidimensional Prompts

ChatGPT: Multidimensional Prompts

ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples

ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples

Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM

Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM

Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?

Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?

Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)

Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)

Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?

Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?

The video discusses new AI research papers that improve AI coding agents for software engineering using entropy regularization, with tools such as Entropo and DPO, and techniques like multi-turn preference optimization and retrieval augmented generation. The papers propose new architectures and benchmarks for SWE, and demonstrate a 5-10% absolute performance gain over DPO. The video is relevant for researchers and practitioners interested in AI coding agents and software engineering.

Key Takeaways

Train AI model on preference data set
Assign standard probability score to each trajectory
Use hybrid selector to select optimal trajectory
Employ entropy regularization framework for policy entropy reduction
Evaluate SWE benchmark performance
Apply multi-turn preference optimization to AI coding agents
Use retrieval augmented generation for AI coding agents

💡 Entropy regularization can improve the performance of AI coding agents for software engineering by 5-10% over DPO, and can be applied to multi-agent systems and retrieval augmented generation.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Research Methods

View skill →

Mechanics of Materials III: Beam Bending

Mechanics of Materials III: Beam Bending

Inaugural Lecture: Juliane Reinecke

Inaugural Lecture: Juliane Reinecke

Saïd Business School, University of Oxford

Hands-On Learning: How and Why You Should Build a Home Lab

Hands-On Learning: How and Why You Should Build a Home Lab

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

Does Water Swirl the Other Way in the Southern Hemisphere?

Does Water Swirl the Other Way in the Southern Hemisphere?

Undergraduate Research Forum 2026

Undergraduate Research Forum 2026

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling