NEW AI Coding Agents for SWE: ENTROPY
Skills:
Research Methods90%Reading ML Papers85%RAG Basics80%Tool Use & Function Calling70%Multi-Agent Systems60%
Key Takeaways
The video discusses three new AI research papers that improve the performance of AI coding agents for software engineering using entropy regularization, with tools such as Entropo and DPO, and techniques like multi-turn preference optimization and retrieval augmented generation.
Full Transcript
Hello community. So great that you are back. Today we talk about software engineering and the latest research to optimize your EI coding agents. Now you know software engineering great LLMs here reasoning over larger and larger code bases and we have now the complexity of multiple tool users. So what about the reasoning process here over code over large code bases and now having tools like search execution patching etc. And we are gaining complexity. So how can we improve now our systems? Now SV bench you know here you see here for bash only with a minimal agent configuration. Those are the performance data. Great. Now there is a challenge here to avoid here the mode collabs where the models here overfit here to only very narrow solution paths and this is particular if you have a configuration where we have here reinforcement learning by human feedback. We have a direct preference optimization here add a test time scaling. Absolutely. So how we do this? So we do not scale the parameters. We do here test time scaling in this particular instance for the codebase. What is the new idea? What is the new innovation? What is currently the topic of research? How to improve here our AI coding agents? Simply entropy regularization. Now I can explain this to you in within 5 seconds. Imagine we have a code base that grows larger and larger and more complex. You need more reasoning. You have more tool use. You have more dependencies of the tools. You have more agents that communicate. So the complexity increases here. At the same time you have your AI coding agent and you say great now you want imagine this here spans now a mathematical space. This is the solution space and somewhere in this space here is the perfect solution for your task. So what is happening in general an agent starts but we are somehow limited to a specific segment if you want in this mathematical solution space and let's say this is here all the complete space of solution that your agent will find for your particular topic. Now you immediately see this is not all the complete solution space. No this is just focus here in a very narrow subspace. So therefore you ask hey what about those solutions? What about those solutions? No. And you see the entropy is so so important. So we want to have a regularization of the entropy so that our coding agent explores here different region of the solution space and is not just focused here on this one. Of course this refers immediately back to the theoretical problem in reinforcement learning. Exploitation versus exploration. So what are the solution? Solution is here presenting your new research paper. Entropo is here. You have guessed it. Entropy regularization is task agnostic. This is beautiful because this is not just now for the complex reasoning domain of code generation of software engineering. But this would also go with mathematics or scientific discovery. But more about this in a later video. Let's stick here with the publication where we go here and we scale here for the rate for our agents with the compute test time compute for an entropy preservation to maintain here the exploration and complex sequential decision making for scanning codebase. Here is the publication September 15, 2025. Northwestern University Capital 1 and Ma Facebook building coding agents via an entropy enhanced multi-turn preference optimization. DPO you know everything beautiful. So let's have a look how can we optimize the next generation of coding agents. Now whatever you go DPO or we go here with economy optimization never mind whatever those methodology do they often reduce the policy entropy as I showed you you only focused in a particular sector a sub sector of the solution space you don't want this so therefore we have to introduce a new term and the authors did this with an entropy regularized framework that extends to those now at first to that multi-term conversation plus tool assisted mark of decision processes preserving here the diversity across all the different trajectories we're going to compute in parallel in test time compute great now if you're not familiar mock of decision process here very short summary great now what we find now what your found if they introduce now this entropy regularization term the entropy augmented the dpo loss this is it and if you go here with multi-turn then you have here the particular Q values that are defined recursively. Is this immediately clear? No. Because in the paper they have some really beautiful case where they showed you exactly for the single turn case and for the multi-turn case how they derive on this formula. But even this is not enough because if you look at the annex there you have the complete mathematical proof of the proposition 3.2. So here you go. And of course they also do this here for 3.3 for the multi-turn. So it is not as easy as it might seem. But I will ignore this for the moment. I will just give you here the result of this new entropy regularized DPU loss function. And here we have it. Yes, of course there is also additional annex here for how exactly mathematically to derive here the entropo loss function. But I guess we just take here the result and we run with the result. And let's have a look if it's really great. If you want do not want to restrict you to DPO, but you go here with the Conor Mantry here, prospect theory KTO the model alignment here as a prospectical optimization. You see this here in reference here to your PO clip and DPO. Great. So we are focused now on scaling. We are going with test time compute inference. So in TTS the agent generates no this is not by the way yes I know some of you might say hey this is text to speech no text to speech is text numerical to speech so current notation here all the university know in here TTS is here test computer inference the agent generates multiple candidate trajectories we go with 16 in parallel for any given problem instance whatever you go bug fixing code repo whatever you have and now we have the task only 16 trajectory but we have to select the optimal trajectory here for our coding agent. How to do this now or just sort about it and anthropo employs here a hybrid selector. This is interesting. You know whenever you're not really sure how to go you know what you do you take a little bit of this one and take a little bit of this one and then you say okay let's build a hybrid solution. This is what happened here. It is great and you guessed it. The first solution is of course here an AI intelligence and the second solution is here a simple rulebased where you say okay this is my uristic um rules that I want that the system applies. So nothing specific we have a probabilistic scoring and we have deterministic rules for the efficiency and interpretability leading here and this is the big question really to empirical improvements. Let's have a look at this. Yes, of course we go with mock of decision process and you know exactly why if you have seen my last videos. Now this particular process now for entropo is interesting. So at first we have this train this AI trained very fine if you want this is device oracle element no supervised learning model train on a preference data set which consist of label trajectory pairs. So we go here with preferred versus non-preferred for the oracle feedback something you know absolute no problem standard procedure. This verifier assigns of course standard probability score here in whatever interval you like to each trajectory to our 16 trajectory thereby estimating a likelihood of success for the training of this trained verifier. Yes of course for your domain specific task you have to have a training data set you have to train it but you can use here binary cross entropy loss here. This is simple. This is familiar. No problem at all to implement this. The second part with the deterministic uristic with the simple rule part now is not as easy. So no learn parameter instead we apply here domain specific criteria catalog derived here from software engineering principle what you want to achieve. In the simplest case they go with a binary indicator that says hey we only go with complete trajectories. Well, of course. Second, also a binary check that executes a repo wide regression test on the proposed patch. Great. And then they have a filter set with the Omax function. They say interestingly, we want to favor here longer trajectory under the hypothesis that they reflect here more server explorations. So, multiple code inspection, test execution or other iterative refinements before the patch submission itself. So okay there they go with this interesting for longer trajectories. Let's have a look look at the benchmark at the evaluation of those and here you see it here test time scaling on software SV bench very bench light look uh supervised finetuning here is this baseline here you see okay not really famous no and then multi-turn KTO we have in yellow yes it gets better here this multi-turn training here but really nice if we have now if we add now this entropy regularized term So we do not segment do not restrict ourselves to a segmentation of the solution space but we try to explore the complete solution space. Look at how much nicer our performance increases now almost uh 58 59%. So you see here exactly 3 to 5.6%. Performance gain with an entropy regularization term. Did we expect it? Yes of course we are looking for more normal ideas. And we are not here stick here to the old solution. We explore the solution space. So therefore in terms entropo outperforms your dpo by 5 to 10% absolute especially here with a test time computer. The scaling due to a higher entropy beautiful entropy term is crucial. Without this we would not achieve this beautiful results. If you want to have here the numerical table from all the benchmark data this is for you. You see our two entropy. Great. So what is the summary? Easy. No, that was simple paper highlights that the entropy regularization in multi-term preference optimization is key for our agentic system. We looked at today in this video at agentic system for coding with if you want theoretical guarantees enabling here scalable test compute diverse exploration in software engineering. Great. And you know what the beauty of this idea since it is if you want also applicable to mathematics to theoretical physics or any other if you want science experiment. This is nice to explore this idea not just for coding but also for if you want reasoning complexities here for visual complexities for example. But of course, it's not the only study I wanted to show you here in this video today because look here, September 2nd, 2025 when agents go astray course correcting SV agent here with also inference time process reward models. So you immediately understand here this is by Carnegie Mal University and IBM research. A very nice study have a look at this. This is another idea a very similar idea. you know interference time you say yes test time compute let's go with this and in this precious 10 20 30 seconds that we have here you have now a process reward model that you build up here to detect and course correct trajectory level errors that happen now in your agentic system so depending now how you optimize for what particular task you have to optimize test time compute scaling this might be another methodology ology to go with. And of course, I have another study here. This is here from September 11, 2025. They say, you know what, this is nice if you go just for an output evaluated result. What about the effectiveness? Is it really the shortest best solution or has the system been wandering around for 10 seconds just looking in one direction ignoring all the rest? So what about the effectiveness of our SWE under resource constraints if you have smaller models if you have not 10 20 seconds if you want to do it in 2 seconds. So what is now the optimization step if you go for effectiveness and here you have who have I Chinese University of Hong Kong King's College London and Queens University and they go here with a new SWE benchmark here using here a multi-dimensional metric trying to incorporate here the effectiveness that they measure here in a very interesting way have a look also at the third study and they say yeah we want to optimize here our if you on test time compute scaling performance. So you see where we are currently everybody is looking here at SWE to improve here to make it faster and to increase here the entropy regularization terms and yeah it looks good at least 5% maybe up to 10% performance jump for your next EI agent code systems. I hope you enjoyed it. See you in my next video.
Original Description
Three new AI research papers to further improve the performance of our AI agents for CODING, Software Enginering. New SWE BEnchmarks available on the new architecture, based on new entropy regularization.
All rights w/ authors:
"BUILDING CODING AGENTS VIA ENTROPY-ENHANCED MULTI-TURN PREFERENCE OPTIMIZATION"
Jiahao Yu
Northwestern University
Zelei Cheng
Capital One
Xian Wu
Meta
Xinyu Xing
Northwestern University
arXiv:2509.12434
"When Agents go Astray: Course-Correcting SWE Agents with PRMs"
Shubham Gandhi
Carnegie Mellon University
Jason Tsay
IBM Research
Jatin Ganhotra
IBM Research
Kiran Kate
IBM Research
Yara Rizk
IBM Research
arXiv:2509.02360
"SWE-Effi: Re-Evaluating Software AI Agent System
Effectiveness Under Resource Constraints"
Zhiyu Fan 1, Kirill Vasilevski 2, Dayi Lin 2, Boyuan Chen 2, Yihao Chen 2,
Zhiqing Zhong 3, Jie M. Zhang 4, Pinjia He 3, Ahmed E. Hassan 5
from
1 Huawei
2 Huawei Canada
3 The Chinese University of Hong Kong, Shenzhen
4 King’s College London
5 Queen’s University
arXiv:2509.09853
#ai
#aicoding
#coding
#softwareengineering
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Discover AI · Discover AI · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Discover AI
Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Discover AI
Create a Smarter Future!
Discover AI
The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
Discover AI
Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Discover AI
Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Discover AI
Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)
Discover AI
Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)
Discover AI
SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)
Discover AI
Discover Vision Transformer (ViT) Tech in 2023
Discover AI
Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Discover AI
Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Discover AI
BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM
Discover AI
Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts
Discover AI
From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
Discover AI
How to start with ChatGPT? | Short Introduction to OpenAI API #shorts
Discover AI
The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor
Discover AI
Microsoft and ChatGPU
Discover AI
From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
Discover AI
Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Discover AI
TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
Discover AI
3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
Discover AI
FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
Discover AI
ChatGPT - Can it Lie to you?
Discover AI
ChatGPT Alternative: Perplexity by Perplexity.AI
Discover AI
2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
Discover AI
Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Discover AI
BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!
Discover AI
Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A
Discover AI
Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Discover AI
After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
Discover AI
Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Discover AI
Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Discover AI
The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
Discover AI
New TECH: Vision Transformer 2023 on Image Classification | AI
Discover AI
PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech
Discover AI
New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
Discover AI
New BING ChatGPT loses its mind
Discover AI
Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Discover AI
Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Discover AI
Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Discover AI
PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
Discover AI
New BING Chat AGGRESSIVE
Discover AI
Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Discover AI
Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Discover AI
Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course
Discover AI
Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Discover AI
Microsoft's CEO in Trouble #shorts
Discover AI
Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Discover AI
OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
Discover AI
ChatGPT polarizes
Discover AI
Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Discover AI
ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial
Discover AI
Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Discover AI
ChatGPT: Multidimensional Prompts
Discover AI
ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
Discover AI
Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Discover AI
Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Discover AI
Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Discover AI
Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Discover AI
More on: Research Methods
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
🎓
Tutor Explanation
DeepCamp AI