GPT-5 Agentic Coding with Claude Code

IndyDevDan · Beginner ·🧠 Large Language Models ·10mo ago

Skills: LLM Engineering90%Prompt Craft80%Agent Foundations80%Tool Use & Function Calling70%

Key Takeaways

The video demonstrates GPT-5 agentic coding with Claude Code, highlighting its performance and capabilities in comparison to other models like Opus 4.1 and GPT-OSS, and showcases the use of various tools and techniques for agentic coding and prompt engineering.

Full Transcript

It's incredible what you can do with a single prompt. We're running GPT5 Mini Nano right next to Opus, Opus 41, Sonnet, Haiku, and the new GPT OSS 20 billion and 120 billion that are running directly on my M4 Max. >> Subagent complete >> MacBook Pro. >> Subente. >> So, you can hear the agents are completing their work. Agent complete >> in parallel >> agent agent complete >> we have natural language responses coming back to us and at the end here we're going to get a concrete comparison of how these models performed across the most important three dimensions performance speed and cost. We have a brand new agentic model lineup that we need to break down in this video. We're going to look at a concrete way of how we can flatten the playing field to really understand how these models perform side by side. All of our cloud code sub aents running in their own respective nano agents have finished their work. >> All set and ready for the next step. Dan, >> we have concrete responses and concrete grades for every single model. So we are using claw code running opus 4.1 in the LLM as a judge pattern to determine what models are giving us the best results. And you can see something really really awesome here. Total cost on GPT OSS 0. In this video we dive into fundamental agent coding and attempt to answer these questions. Can GPT5 compete with Opus 4.1? Has useful ondevice local LLM performance been achieved and what's the best way to organize all the compute available to you? If these questions interest you, stick around and let's see how our agents perform on fundamental agentic coding tasks. So, as you've seen already, every single tech YouTuber, content creator, they've turned the camera on, recorded the screen, and literally just spat back out the benchmarks of all these brand new models back out to you. Just regurgitated exactly what the post tells you itself. Okay, if you've been with the channel for any amount of time, uh you know that we don't do that here. We dive deeper. We actually use this technology and we develop a deep understanding so that we can choose and select the best tool for the job at hand. There is one trend that matters above all. Right now, it doesn't matter if we're talking about GBT5, whatever anthropic has cooking next, the cursor CLI or any other model that's getting put out right now, open source, closed source, there is one thing everyone is focused on. If you've been paying any attention, you know exactly what it is. It is the agent architecture. Why does everyone care so much about agents? First, let's understand how you and I, engineers with our boots on the ground, can have a better, deeper understanding of these models agentic performance. It's not about a single prompt call anymore. It's about how well your agent chains together multiple tools to accomplish real engineering results on your behalf. What just happened with this prompt? You can see here we have rankings. Surprisingly, we have Claude 3 Haiku outperforming all of our other models. If you look at this, it looks backward. We would expect these models to be on top and these other models to be on the bottom. What's going on? So, we're evaluating all of our models against each other in a fair playing field where we care about performance, speed, and cost as a collective. These agents are operating in a nano agent, a new MCP server to create a fair playing field where every one of these models is scaffolded with the same context and prompt, right? Two of the big three. And then we get to see how they truly perform. Right? If we put these models inside Cloud Code, I can guarantee you Opus and Sonnet will outperform. But you can see here for this extremely simple task, what's the capital? we can see very different results out of the box. Okay. And of course, we're going to scale this up. Let's go ahead and throw a harder, more agentic problem at these models. I'll run that hop. Hop is higher order prompt. We'll break that down in just a second. And then we have this prompt that we're passing into this prompt. Basic read test. Let's fire this off. We're running a multimodel evaluation system inside of Cloud Code. So, we prompt a higher order prompt. We pass in a lower order prompt. We then have our primary agent kick off a slew of respective models running against the nano agent MCP server with a few very specific tools. You can think of this server as like a micro Gemini CLI, a micro codec, a micro claw code server that our agents can run against >> sub agent complete >> in a fair playing field. You can see our agents are starting to respond here. sub agent complete >> and then they return the results and of course >> subent >> they report back to the primary agent and then our primary agent reports back to us. So this is the multimodel evaluation workflow. Higher order prompt lower order prompt. We have models that call our nano agent MCP server and the whole point here is to create a true even playing field. I and you need to evaluate a gentic behavior against these models and understand the tradeoffs they all have. Right? performance, speed, cost. Okay, all of these mattered. All right, you can see our results coming in here. Let's go ahead and dive in to this codebase, the nano agent codebase, and understand exactly how this is formatted so that we can benchmark brand new fresh state-of-the-art models like GPT5 against the new Opus 4.1. And ultra excitingly, these brand new GPT open-source models, OpenAI, >> absolutely cooked on these models. Again, these are running right on my machine right now. Let's open up this codebase and understand the setup. As usual, we have our primary agentic directories. We have our plans. We have our application specific nano agent here. We have our app docs, AI docs, and most importantly, our commands and our agents. Okay. So, if we open up commands, you can see we have this directory perf. So inside of Perf you can see we have a hop a higher order prompt and we have lops okay lower order prompts. So >> sub agent complete. >> So this is a powerful prompt orchestration or prompt engineering or context engineering whatever you want to call it I don't care. This is a powerful prompt orchestration technique you can use to reuse tople prompts and pass in prompts as a lower level. We've covered this on the channel. You know subscribe, like, do all that good stuff so you don't miss out on these advanced prompt engineering techniques. But you can see here we have a simple grading system. All right, S through F where S is the best and F is the worst. We have a classic prompt format. We can collapse everything to quickly understand it, right? Use the nano agent MCP server, execute and then report the results in the response format, right? And so you can see the response format is a simple grading scheme. We are using claw code opus 4.1 as an LLM as a judge to manage all this for us. And then inside the evaluation details, we're just passing in the lops, the lower order prompts, right? The prompts that contain the detail that we want to swap in and out as we evaluate different agentic behavior. All right, if we open up the terminal again, we can see how we're doing here. Looks like we are waiting on GPoss 20 billion and everyone else has completed. Not a ton of surprises there, right? On device does take some time to run. There we go. We just got that completed and now cloud code is going to formulate these results into a concrete response for us. It's going to do the evaluation. So you can see those tokens streaming in there. Hopefully I still have enough uh opus for a great evaluation here. Let's see how that goes. But so we have a higher order prompt here and then you know here's our eval one that we ran. This is our dummy test. So you can see the structure here, right? We're firing off cloud code sub aents and we're having our sub agents then fire off our nano agent server. So if we open up GPT5 nano right so this is a cloud code sub agent and all it does is it takes whatever prompt was passed in it has access to a single tool right our MCP nano prompt nano agent tool and then we just pass in whatever our parent gives us right whatever the primary agent gives us so simple enough you can imagine that we >> all set ready for your next step >> just continue doing this along every other model we want to run so you can see here. Here's GBT5 uh mini, right? But we can easily just swap this out, right? You can see here there's nano, here's mini, here's five. And then we repeat the same thing for the correct OpenAI local models, which are absolutely mind-blowing. We have 20 billion and 120 billion running right on my M4 Max, MacBook Pro. This is a 128 GB unified memory machine. This thing is absolutely cracked. you know these models run on the device and they are doing agent decoding work as you'll see in these results. Let's go ahead and hop back to the results. Pretty excited to share this with you here. But you can see how these prompts are set up right these these are our agents and our lower order prompts just detail the exact um benchmark that we want run right so on our dummy test the prompt is what's the capital of the United States respond in all your JSON format structure so we can get all the auxiliary metadata coming out of the nano agent MCP server. All right that's how this is set up. We have another interesting result here. Okay, you know, we have the raw outputs. Here's all of our agents that executed, right? We had nine nano agents firing off. And then we have the respective responses. Um, you can see here we're looking for first 10 lines and last 10 lines in the specific prompt format. We can take a look at the result. Let me just quickly show you exactly what that prompt looked like. So this is a lower order prompt basic read. So if we look at this, we have instructions and then we have variables. And the prompt here is the most important. So we're saying read the readme file. Provide exactly the first 10 lines and the last 10 lines of the file. So this is in a gentic task. This is a little more advanced. We're just stepping up the difficulty scale just a little bit from just asking a simple question. We just want it in this exact response format. Right? So we're testing for instruction following. We're testing for tool use. Right? In a second here, I'll show you the exact tools that our nano agent MCP server can call. Then we fire off all of our agents and we have an expected output. Right? So, we're just sticking to great prompt patterns. We're writing extraordinarily clearly to our agent, both our primary agent and our sub agents, and we're, you know, being really clear about the flow of communication between our agents, right? We need to make sure that we're orchestrating the communication extraordinarily well. So, that's what we're doing here. But the key here is right here's the agentic prompt that's running. Read the readme. Give me the first 10 lines and the last 10 lines. Okay. So this is what we're evaluating our models against. So now we can say you know for this fundamental agent decoding task, how did our models perform? Okay. And we are evaluating on performance. So did it do the job? Speed and cost. Okay. And so, you know, obviously if the model can't do the job, it doesn't matter if it's fast or how much it costs or how cheap it is, right? All right. So, you can see kind of some rough grades here, right? If you look at the overall breakdown of this task, we have some rough grades. Okay? It's not all roses when you really flatten the playing field, right? Uh take a look at Opus for instance, right? We all know that Opus costs, but we don't really realize how much this model costs. It's extraordinarily expensive. Okay. And you can see here, for some reason, GBT5 had to churn and turn and churn um its output tokens to figure out how to get this response properly. Okay. So, terrible cost there. I'm actually surprised this chewed up that many tokens. I've seen this run much better, but I have seen some weirdness with GPT5 through the API. It's taking a crap ton of time. Maybe it's just me. Looks like that model's getting slammed. But then we can see something else really interesting here, right? Um there are some good fast options and you know once again we can see they absolutely cracked benchmark ghosting model cloud for sonnet you know performing really well. Let's look at that cloud for sonnet grade here. It actually got a D. Wow. Why did it get a D? Performance does not look good. Right. So if we look for that Claude 4 response. Yeah. So you can see what's happening here. Right. It's not giving us that exact response format. Look at it's got this preamble. Right now I'll extract the first blah blah blah blah blah. Right. That's not what we asked for. So, we have to dock points. On the other hand, though, we have some really great nano and mini GPT5 agents performing really well. Okay. So, why are these performing well when we combine an overall grade, right? It's because once again, we're not just looking at performance. GBT5 Nano is giving us that exact response. We were looking for the first 10 lines and the last 10. And then we have last 10 right here. So, very clear, very clean. And we can of course open up the readme and see that exactly. Right. So here's the read me. Here's the first 10, right? And you can see the last thing there is clone the repository on that first 10 lines. So that looks great. And then the last 10, right at the bottom here, we have license MIT. Scroll to the bottom. Look at this. 600 lines. The bottom is of course license MIT. So that's great. The results are not exactly what you would expect cuz we're accounting for speed, cost, and performance. So you can see here GPT5 did win on performance. But when you put it all together, right, when you put it together with the speed and cost of GBD5, uh, the grade gets dragged down. So very interesting. Let me show you another interesting example where we can push these models to do more hop a file operations test. So not only are our models going to read, they're going to write files here. So let's fire this off and let's understand what this prompt does. So file operations, this prompt is a little more involved. Here's the prompt that we're passing in to every nano agent. Complete the following tasks. Read.claw settings.json. Extract all the unique hook names. And then create a JSON file with this structure. It needs to be precise about the JSON structure. Create another file with the content model name was here. Successfully completed operations test. And then you list the current directory to show the files created. This is a really interesting one. Right. Again, we're just slowly pushing up the difficulty level for our agents to battle on a fair playing field. Okay. Okay, I think this is really really important for truly understanding agentic capability. But if we scroll down to the bottom here, you can see our parallel sub aents are getting to work here. There's a there's a claw version, there's a ha coup version, there's GPT5 mini, there's our oss, right? >> Open- source 20 billion running right on my device. Right, we can click into this. >> Check this out. It has >> read the file and it outputed this like check. This is so incredible. I got to say out of all the models that were released, you know, Opus 4.1, GPG5, so far I'm most impressed with the open source models from >> sublet. Uh this is a >> viable agent coding happening on device right now. Okay, so this is >> sub agent complete. >> Very very very crazy. Okay, very interesting. And let's close this, open up the terminal and see our agents working. You can see most of them have already finished. Our oss 20 billion is still formulating its output response there. But you know, you can see just one tool use in all of these. There's our oss responding. >> Sub agent complete. >> Okay, fantastic. So we have all of our sub aents complete. Now our LM as a judge cloud code on Opus 4.1 is going to put all the results together and based on the higher order prompt, it's going to put all the results together in a concrete response format. It's going to evaluate them, right? You can see this exact rubric, right? You can see our performance and our agent responses. So this is all happening thanks to our top level cloud code, right? The agent architecture plus the ability to call the right tool plus a powerful crack model >> and ready for the next step, Dan. >> All right, so let's see what we have here. Surprising results, right? Take a look at the results. Not what you would expect, right? These five nanos and five minis seem to be very good, very fast, accurate instruction following agents. We even got OSS 20 billion operating very very quickly. Okay. And so, you know, it's so interesting to see these results, right? Let's go ahead and take a look at the actual prompt we're looking for. And then we can, you know, double check some of this work, right? Let's look at this prompt in detail. You can see all these files, right? These are agentic tasks calling tools, right? Calling a list of tools to accomplish the result. Let's look at the file operations. Okay, so read this file. We created summary model name.json and then we created a model was here and then we listed all the results inside of the agent. Okay. So let's take one agent right let's let's take let's take our top and our worst performer right so DBT5 nano how did this perform right so let's look for this we should have this here now so we have the model name GPT5 nano here are all the cloud code hook names pre-post notification stop sub pre blah blah blah right so perfect format there what else did we ask for model signature name let's open this up and again let's look for GBT nano and you can see GP5 nano was here successfully completed file file operations. Okay, exactly what we were looking for. And then list directory. So very precise, right? You know, it read a file, it outputed a JSON structure, and then it wrote two files, right? It wrote these two files. What happened to our best model, right? Which is by the way costing me quite a bit, right? Every time I run this, what what happened here? Let's let's take a look, right? So let's start with summary opus. Uh you can see here results look great, right? Let's look for signature opus. And once again, it accomplished the task perfectly. Okay, so if we scroll here, right, our opus 4 got an S, right? So it performed the task perfectly. But of course, we can see speed and cost brought the grade down quite a bit. And I was actually supposed to look for Let's actually look for Opus 41. So what happened with Opus 41? 41. Um, Opus 1 does not have an output file. Okay, so uh that does it. For some reason, Opus 41 just does not have an output file. Maybe this was my fault. So maybe I uh let me look for 41 agent. Wow. No. Uh it just doesn't have an output file. Interesting. So uh yeah, that explains why that does not work and why it got such a low grade. But um you know the great part about this architecture here is maybe this is a perfect thing to kind of showcase. We can right here in the terminal fire up a claw instance. These are all individual sub aents, right? It's really important to atomize your units of compute so that you can test them and scale them up. I have a base level MCP server that's getting called out of a cloud code sub agent. Right? I'm explicitly saying inside these prompts, use this MCP server. I'm passing the parameters. There is zero confusion about what's going on here. Right? So this agent will run this and then it's going to report the results as is. Right? So we can go ahead and just run this against our 41. There we go. So there's our nano agent. And now we pass in the prompt to execute. So, you know, again, I'm just going to be really blunt and pass in the exact prompt we were looking for. Uh, this prompt here. And while we're doing this, right, in the background, we can just keep turning away. So, let's go to the next task. So, hop. And then we can pass in our next evaluation. So, there's that. Passing that lower order prompt into the higher order prompt. This is a great way to increase your velocity. Let's just copy this prompt as is. And let's prompt our agent. Okay. So, all this is going to get passed directly in. What you'll see here is a cloud code sub agent spin up this exact, you know, series of commands that we want our nano agent to run. And all we're doing here, you can see there's that agentic prompt. There's our tool call nano aent prompt nano agent. And the signature for this is quite simple. We can just search for this def nano agent. And in our nano aent python file, you can see exactly what this looks like. We have a rich description, right? A rich function comment for any callers of this tool, right? because this is a um MCP server tool that's getting called and >> sub agent complete >> and this just passed off the work to a concrete lower level agent >> all set and ready for your next steps Dan >> you can see here we now have that output and we now have the correct responses if we open this up there we go you can see opus 41 performing the results as we wanted maybe got overwritten who knows what happened >> sub agent complete >> but the important piece here is that you want that granularity of your agents at any point in time now I can just you know spin up a ondevice >> sub agent complete >> GPT OSS 20 billion parameter and you know just for fun let's do that right so if I hit up here and I switch out the agent uh let's do the 120 right so we'll have the 120 run this exact same prompt again and you know it did complete its work so I'm just going to delete it here so we can rerun this okay and we're going to fire that off it's going to have the nano agent perform these tasks right so this is a >> ultra powerful We're composing powerful units of compute, right? We're using the great cloud code agent architecture with powerful capabilities to to call tools in parallel to call sub agents and call our specific sub aents, right? We can open up our nano agent GPT OSS 120 billion and uh you know you can see exactly what's happening here, right? Very very powerful. This is a gentic coding running on my device. This is mindblowing. Okay, you know, the performance probably has some work to do, but if we attempt to Okay, yeah, we got to fix the scrolling issues here, guys. Um, we can't scroll at all. I I know I'm not the only one that experiences issues like this, but >> so agent complete. >> Anyway, uh, we'll let this complete. GPT OSS models for these again simple. I realize these are simple agenta coding tasks, but that doesn't change the fact that we have proof here that these GPT OSS models running on your device can do easy to simple to maybe even some moderate difficulty agentic work for you. Okay. And this is running on a blank instance, right? A blank equal playing field agentic system. And maybe that's a good place to go now. How am I actually running this? What does that MCP server look like exactly? I only have a single tool call. So what's going on underneath the hood? You can hear my device. My Mac is working now to accomplish this task. And you know, by the way, I am running, you know, uh how many models are we running at the same time here, right? For a moment there, we were running like 10 models at the same time. Some hitting cloud, three of them on device. You can see the time consumption there on TPToss 20 billion. Definitely consumed some time. Beautiful. Look at this. 120 completed this task for us. We should have those files back now. There it is. 120. >> All set and ready for the next challenge. >> Beautiful, right? 120B completed that task. Here it is. Right. All of the hooks once again recreated. There's the signature. Right. Successfully completed file operations. Right. This is a gentic coding. Our agents are operating our device. They're doing engineering work. And for the first time, we have an ondevice model that is completing work on our behalf. This model when this ran, aside from the cloud code pieces of it, right? All that agentic work happened on my device which is insanely incredible right I'm running on a llama by the way um let me break down exactly how this works right what is this nano agent so the nano agent is quite simple if we open up main here and the nano agent I have a simple MCP server okay and it has a single tool execute an autonomous agent with natural language task so this is just like uh claw codes task tool right? When it spins up generic agents or specific sub aents, it's kind of the same deal except here we're just passing in a prompt and then our prompt gets interpreted by our nano agent and it runs a specific tool. So you can see here we just have that one tool prompt nano agent and prompt nano agent takes an agentic prompt a model and a model provider. That's it. The agentic prompt is where all the magic happens inside the nano agent. We are actually running another agent. Right? If you remember the flow here, this is what's happening. our user calling a prompt. Our primary agent is kicking off all of these different models and model providers against our nano agent MTP server. And then you know if we had some space here we would see our nano agent itself consists of a few tools. So let's go ahead and understand the tools our nano agent has. Okay. So execute nano agent. Um at some point here we're going to see our tool selection right here. And what do we have here? Right? Get nano aent tools. Check this out. This is all it has. Right? very simple, very concise, write a subset of, you know, modern agentic coding tools, tools. Um, but you can see, you know, just read, write, list, get file, edit file, just as you would imagine, right? And, you know, you can just see all those tools here. So, how does this all run? We are running on the OpenAI agent SDK. So, if I look at create agent here, uh, we should get at some point agent, which is coming right out of the OpenAI agent SDK. We can look at the PI project, right? And this is it, right? So this is our agent scaffolding. They have some great tooling here. I highly recommend if you want to experiment, build out with some basic agents to really understand how the most important architecture of the year and probably of the next few years is built and how to use it and how to build, you know, your own agents. This is a great way to get started quickly. The open agents SDK is fantastic. This allows to create a very fair and balanced way to compare models, right? Because in our provider config, uh we can specify, you know, is this an anthropic model? This is an O Lama model, right? And then we can just update the endpoint. So great way to test on a fair playing field. These all call the respective tools and get work done on our behalf. So that's basically the nano agent, right? It's a generic prompt that calls one of several tools and these tools just do some operation, right? And the incredible part here is that uh I think the the playing field for agents has really opened up this week. Literally just this week, we have a ton of new options for agentic coding and to get work done both on and off device. To be ultra clear here, it isn't just the model that matters, right? There's a lot of work that goes into something like clog code to make it the best, most efficient, most scalable, most consistent. Consistency is really important, the most consistent agent. So, I think that, you know, all these models, right, you still want a top level state-of-the-art agent. it that is clearly cloud code. I'm by no means substituting cloud code because there is no substitute. What we are doing here is understanding a gentic model capability and taking some time to understand when and how we can start delegating some of our work from cloud code to alternatives that are more specialized for certain tasks. Right? Of course, there are three things that we focus on. Performance, speed, and cost. you now have more options than ever to really decide what you want to trade off. Okay, a lot of the times when you're really working and churning, the only thing that matters is performance. Okay, now you can do some really powerful things with this new Opus 4.1 GPT5 is, I think, comparable. I'm running into issues with this model at scale. I'm more impressed and and surprised with the mini nano and the OSS models than I am with GPT5. Not to say that GPT5 isn't a great model. Um, I just need to spend some more time with it. need to understand it at a deeper level. You can see here in this agentic prompt um GP5 actually performed pretty well, right? So once again, very interestingly, we have nano and mini in first place when we combine a conglomerate response. We're looking for the best overall grade across performance, speed, and cost, right? But as you can see here, right, the true colors are starting to show a little bit, right? Opus 41, Opus 4, Sonnet 4 S tier performance. And across the the the tests that I have done, I, you know, just got some time to set up this nano agent. You know, with the kind of few tests that I've had time to dig into to understand these models at a deeper fundamental unbiased level. Pretty clear and pretty consistent that these are still your top models, but we have new viable agent decoding models. Really happy to see this. Let's open up lop 4 lower order prompt eval 4. What is cloud code? the primary agent asking our sub agent to accomplish with the nano agent MCP server which itself is a little micro agent. Okay, it's asking this perform the following code engineering tasks. Read the file. Right, so we're reading our constants file right right here. Okay, bunch of constants for this codebase. Analyze the code structure and then we're going to do this. Create a new Python file called analysis model name. replace model name with your actual model name that contains dock string analysis and then a function get constants report and then a comment at the bottom of the file. So, so again here we're just like testing the agentic behavior of these models with these kind of intricate step-by-step again agentic prompts, right? We're out of the world of just chatting. We're out of the world of prompt chaining in specific workflows. We want arbitrary long sequences of engineering tasks completed by our models. Okay, a single prompt eval is not enough anymore. It's about the string of tasks your system can accomplish. We've scaled far beyond the prompt, far beyond the prompt chain. We're now at this Asian architecture that interacts with this environment over and over and over. Okay. And then we have create another file here. Enhance constants. Add one useful constant. And you can see exactly what it's saying to add. Include all original constants plus your addition. All right. Return a summary. Right. So let's look at uh one of our models performance here. Let's look at a perfect example, right? So we have some S tiers here. Let's go ahead and take a look at sonnet 4. If we just look for all these files. So we should see analysis sonnet 4. There it is. So analysis model name. And check this out. It's a Python file with its analysis at the top. And then we have constants report. So check out this great report right as a function in this file. And then at the bottom we said analysis completed by. Okay. So, it's following instructions extraordinarily well. And then we have one more file here. Right there it is. Enhanced constants claw 4. And check this out. So, it has the exact format as our constants. Go side by side. If we save both of these to get the formatting the same, they have the exact format. Five mini 5 mini. There's it's literally just replacing the constants, right? So, it wrote this new file. And if we scroll to the bottom here, we will have the enhanced signature as specified by the prompt. Enhanced by XYZ. Okay. We're not surprised at all, right? We know that Sonnet, we know that the 4 series is absolutely cracked, but we can also see, you know, we have a bunch of other models here performing fairly well. A tier performance, OSS 20B. Let's see how it did. I think you get the point. I I'm just like really curious about this, too. This is the first time I'm running these tests. Uh we're looking at these live together here. So, you know, let me look at that first file. Analysis 12B. Okay, so there's an analysis. There's a report. And there it is. Analysis completed by OSS 12B. Fantastic. And then we have that enhance constants. Okay, go ahead and open this up. Enhance constants. You know, let's look at this side by side. And you can see GPT OSS 120 billion. It looks like it's placed the file as is side by side like perfectly, right? How many lines do we have here? We have 114 in the original and OSS added that model signature as prescribed. This looks incredible. It accomplished the task. The performance is there. Of course, on the OSS um running on device, it did take quite a bit more time than these other models, but again, look at the total cost is running on my device. Uh this is just so incredible, right? I know I'm glazing hard on these models, but I think this is a big big deal right now. And I don't think many engineers understand this because you know what what do most what do most of us do? We just, you know, look at the blog post. We just scroll scroll scroll. We look at the benchmarks and then we complain about how it's only two or 3% higher and you don't really understand what it means. Okay? Like most engineers don't actually understand what you know a 2% increase in aenta coding really looks and feels like okay over the previous generation right it it's it's actually massive okay and with every bump up on all these different benchmarks you get something incredible that you can't see by just looking at the benchmark right you need to feel these tools you need to run some type of you know specialized workflow you just spend time you don't even have to run evals right if you really want to understand the model definitely on eval. But you don't even need to go that far. You just need to spend time running these models against real problems. Okay? Don't trust any individual benchmark. It's their job to tell you that they're doing a great job. Okay? And I love every one of these companies as much as the next. I'm a big fan of every single big AI, you know, Gen AI company. But at the end of the day, you need to crack open the hood of all these models and say, where is the true value? because I can guarantee you, right, there are some uh cheap, fast, high performing models that you can tap into now. You just need to actually know that you can, right? You need to know that it's even possible. This has been a breakthrough week. It's very clear to me that there is more compute than ever to tap into. I'm able to understand and move and work through these innovations because I understand the industry at a fundamental level. Right? Everything we do is based off just one concept. The big three context, model, and prompt. Everything is based off these. If you understand these, you can build evals, you can build benchmarks, you can build agents because they're all just scaffolded on top of these, right? So, if you're interested, you know, check out principled AI coding. This is my hand-crafted course that I built to help you understand how to excel and how to stay relevant with today's tools and tomorrows. Thousands of engineers have taken this. They've gotten serious value out of this. You've learned how to approach the industry. There's going to be a big bonus for anyone that takes principal a coding. You'll get a discount on the next upcoming agentic coding course. Okay? You know, this course has been out for, you know, about half a year now. All the ideas are still relevant. You can use cloud code or whatever agent coding tool you want inside this course. If you're interested, but you want to be prepared for what's coming next, okay? And what's coming next is serious agentic coding. It's it's beyond that. It's agentic engineering, right? You see this here, right? Um I have local models performing agentic coding tasks on our behalf, right? They're accomplishing things that previously only claude code could do. Okay, things are changing. The landscape is shifting. What doesn't change is fundamental principles of AI coding. Okay, because agentic coding is just a supererset of AI coding. Okay, I've said this a 100 times. I'm going to keep saying it so that everyone, right, if you made it to this video, if you made it to the end, you're one of the lucky few that has an opportunity to really excel and push yourself forward, right? Invest in yourself to understand this concept. We talk about key ideas like closing the loop, you know, programmatic AI and agentic coding. And this is all preamble. It's all set up. It's all the system prompt for the phase 2 agentic coding course I'm going to be releasing in September. I'm really excited. I work on this thing 24/7 when I'm not uh you know, eating, sleeping, and resting. Okay. So, we have more compute than ever. It's not about the prompt anymore. It's not about the individual model anymore. It's about what the model can do in long chains of tool calls, right? The true value proposition of models is being exposed. It's real work end to end. And the thing to keep an eye on is, do you know how to trade off performance, cost, and speed when the time is right? Because I can guarantee you, you don't always need Opus 4. Okay? You might be able to settle with GPT5, which is much cheaper, by the way, than Opus 4, right? Or you might be able to go further, right? You can just use five mini. Maybe you need to scale to sound at four for that task. Fine. But maybe, you know, you can build your own specialized small agent, right? You can build off from the nano agent codebase that is going to be available to you. Link in the description. I'm going to clean this up and make sure it's available for you to you so that you can, you know, understand agentic coding at a fundamental level. But you know maybe you can go even further beyond and use a small ondevice model. These are only going to get better. So you want to have the infrastructure in place and the tooling in place to understand the capabilities so when it's ready you can hop on it. I I can tell you right now I'm going to be investing more into you know these models across the board so I know what tasks can be accomplished by what model so I can make the right tradeoff. Okay. Engineering is all about tradeoffs. performance, speed, cost. At different times of the day, based on the task you're working on, different things matter. Okay, so super long one. Thanks for sticking with me here. Um, you know where to find me every single Monday. Stay focused and keep building.

Original Description

Did GPT-5 prove there's a wall? I don't care at all and neither should you. There are more important things to focus on like agentic coding, gpt-5, gpt-oss, opus 4.1, and COMPOSABLE compute. GPT-5 and Opus 4.1 just dropped with GPT-OSS models running on-device - but which one actually DOMINATES agentic coding? 🤯 These aren't your typical benchmark regurgitations - we're testing real agentic workflows with CONCRETE engineering tasks. 🎥 VIDEO REFERENCES: - Nano Agent MCP Server: https://github.com/disler/nano-agent - PRINCIPLED AI CODING: https://agenticengineer.com/principled-ai-coding?y=tcZ3W8QYirQ - OPENAI AGENTS PYTHON: https://openai.github.io/openai-agents-python/ - GPT-5 POST: https://openai.com/index/introducing-gpt-5/ - CLAUDE OPUS 4.1 POST: https://www.anthropic.com/news/claude-opus-4-1 - CURSOR CLI: https://cursor.com/cli 🔥 We're beyond single prompts now - it's ALL about agent architecture! Watch as we put GPT-5, Opus 4.1, and the groundbreaking GPT-OSS models (20B and 120B running ON MY DEVICE) through fundamental agentic coding tests using our custom Nano Agent MCP server. This isn't just another model comparison - we're evaluating the three things that actually matter: performance, speed, and cost. Using Claude Code as our LLM-as-a-judge aka Agent-as-a-judge, we create a completely fair playing field where every model gets the same context, tools, and prompts. ⚡ Key Insights discoveries in this video: - GPT-OSS models performing legitimate agentic coding LOCALLY on an M4 Max MacBook Pro - Why GPT-5 Mini and Nano are crushing expectations in real-world scenarios - How Opus 4.1 performs when cost and speed ACTUALLY matter - The game-changing potential of on-device AI agents for coding tasks - Legit comparison between GPT-5 and Opus 4.1 (Opus still leads but gpt-5 is priced HYPER COMPETITIVELY) 🛠️ We dive deep into fundamental agentic coding patterns using our Nano Agent architecture - a micro-agent system that levels the playing field. From s

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from IndyDevDan · IndyDevDan · 0 of 60

← Previous Next →

Senior developer codes ENTIRE electron app in 30 days (not for beginners)

Senior developer codes ENTIRE electron app in 30 days (not for beginners)

How I code custom components with vue.js, electron and GitHub Copilot (ASMR)

How I code custom components with vue.js, electron and GitHub Copilot (ASMR)

Coding a progress bar using vue.js, progressbar.js, pinia, and electron

Coding a progress bar using vue.js, progressbar.js, pinia, and electron

Vue + Electron settings menu and switch component wrapper (GitHub Copilot FTW)

Vue + Electron settings menu and switch component wrapper (GitHub Copilot FTW)

Zen mode, Hot keys, and circle progress bar in vue.js

Zen mode, Hot keys, and circle progress bar in vue.js

Coding picker components in vue.js for TIMEVA customizability.

Coding picker components in vue.js for TIMEVA customizability.

Coding a micro mode progress bar in vue.js on the balcony like a proper digital nomad.

Coding a micro mode progress bar in vue.js on the balcony like a proper digital nomad.

How to use dynamic css variables to create custom color themes for Timeva.

How to use dynamic css variables to create custom color themes for Timeva.

Building a minimal account page for my electron + vue.js app

Building a minimal account page for my electron + vue.js app

This is the final devlog

This is the final devlog

How to build and launch your next app in 30 days

How to build and launch your next app in 30 days

Learn Pinia in 10 MINUTES (Vue.js 3)

Learn Pinia in 10 MINUTES (Vue.js 3)

Learn Tailwind CSS by making a Cheatsheet | (30 Key CSS Properties)

Learn Tailwind CSS by making a Cheatsheet | (30 Key CSS Properties)

GitHub Copilot being hella useful when coding Electron + Vue.js app

GitHub Copilot being hella useful when coding Electron + Vue.js app

Vue Animations in 3 Lines of Code. (VueUse Motion)

Vue Animations in 3 Lines of Code. (VueUse Motion)

How to use VCCode Macros for Insane Developer Productivity (5x, 10x, 25x, 100x gains)

How to use VCCode Macros for Insane Developer Productivity (5x, 10x, 25x, 100x gains)

Is It Hype? Senior Engineer Learns GraphQL, Rages and Complains About Docs (RAW TAKE - Part 1)

Is It Hype? Senior Engineer Learns GraphQL, Rages and Complains About Docs (RAW TAKE - Part 1)

Is it Hype? Learn GraphQL by building an Express + GraphQL App (Part 2)

Is it Hype? Learn GraphQL by building an Express + GraphQL App (Part 2)

So you have an idea for an app. What's next? (3 Actions You Can Take Now)

So you have an idea for an app. What's next? (3 Actions You Can Take Now)

Coding Vue.js Components, Hooks, and Pinia State for Timeva v2

Coding Vue.js Components, Hooks, and Pinia State for Timeva v2

Coding Pomodoro Chaining (Vue.js, Electron, Pinia)

Coding Pomodoro Chaining (Vue.js, Electron, Pinia)

Programming Pomodoro Chaining PART 2 (Vue 3 Hooks Have Changed Me)

Programming Pomodoro Chaining PART 2 (Vue 3 Hooks Have Changed Me)

Chill Vue.js 3 Coding (Pom Chaining Part 3)

Chill Vue.js 3 Coding (Pom Chaining Part 3)

Senior Engineer Codes New App Feature With Vue.js, Copilot, Electron and TS.

Senior Engineer Codes New App Feature With Vue.js, Copilot, Electron and TS.

Is It Hype? Github Copilot (The Future of Programming)

Is It Hype? Github Copilot (The Future of Programming)

Achieving Balance as Engineers who want more from life (Raw Discussion)

Achieving Balance as Engineers who want more from life (Raw Discussion)

Indie Hackers Most Important Resource: RUNWAY

Indie Hackers Most Important Resource: RUNWAY

Timeva V2 - Customizable Productivity Timer For The Digital Age

Timeva V2 - Customizable Productivity Timer For The Digital Age

Notion API In 5 Minutes: Authentication (Python)

Notion API In 5 Minutes: Authentication (Python)

Notion API in 5 Minutes: Write (Python)

Notion API in 5 Minutes: Write (Python)

Notion API in 5 Minutes: Read (Python | Copilot)

Notion API in 5 Minutes: Read (Python | Copilot)

The AI Wave: 3 Years 3 Predictions 3 Actions (ChatGPT will be a Joke)

The AI Wave: 3 Years 3 Predictions 3 Actions (ChatGPT will be a Joke)

Notion API in 5 Minutes: How to Read Notion Databases in Python

Notion API in 5 Minutes: How to Read Notion Databases in Python

Notion API In 5 Minutes - Database Write (Add new rows in Python)

Notion API In 5 Minutes - Database Write (Add new rows in Python)

Automate Everything: Using The Notion API to automate tweets. Let’s Code

Automate Everything: Using The Notion API to automate tweets. Let’s Code

Going Serverless: Using Vercel Functions for our Notion Twitter App

Going Serverless: Using Vercel Functions for our Notion Twitter App

Serverless Cron Jobs: Automatically Run Your Serverless Functions With QStash And Vercel

Serverless Cron Jobs: Automatically Run Your Serverless Functions With QStash And Vercel

Let’s Break The Internet: ChatGPT API + Notion Infinite Tweet Generator

Let’s Break The Internet: ChatGPT API + Notion Infinite Tweet Generator

Survive the AI age: Managing AI generated content with Notion, Python, Vercel, and Cron.

Survive the AI age: Managing AI generated content with Notion, Python, Vercel, and Cron.

The AI Engineer: The Future of Programming

The AI Engineer: The Future of Programming

Master Disruption: How Top AI Engineers Will Dominate the GPT-X Era

Master Disruption: How Top AI Engineers Will Dominate the GPT-X Era

FFmpeg, GPT-4 & WhisperX: Convert Horizontal Videos to Vertical (97% AI)

FFmpeg, GPT-4 & WhisperX: Convert Horizontal Videos to Vertical (97% AI)

Why Use LangChain? A Blunt Overview for Advanced Engineers

Why Use LangChain? A Blunt Overview for Advanced Engineers

Nuxt + Vercel KV: Coding an AI Agent Network MVP (flow state devLog)

Nuxt + Vercel KV: Coding an AI Agent Network MVP (flow state devLog)

Build VueJS Components While You Sleep: First LLM Agent Network (V2)

Build VueJS Components While You Sleep: First LLM Agent Network (V2)

My Top 6 Modern Vue.js VSCode Snippets

My Top 6 Modern Vue.js VSCode Snippets

useComposable - Vue.js Composable Generator (GCP + Serverless + LLM)

useComposable - Vue.js Composable Generator (GCP + Serverless + LLM)

Let's Get Fired: Using AI Coding Assistant AIDER to do my Engineering Job

Let's Get Fired: Using AI Coding Assistant AIDER to do my Engineering Job

Writing code without coding - Browser TTS with AIDER (ASMR DEVLOG)

Writing code without coding - Browser TTS with AIDER (ASMR DEVLOG)

Learn Anything With AI: HTMX - FLASK - AIDER (asmr devlog)

Learn Anything With AI: HTMX - FLASK - AIDER (asmr devlog)

Advanced Prompt Engineering Techniques for FRONT-END Engineers

Advanced Prompt Engineering Techniques for FRONT-END Engineers

I'm DONE writing tests - using AI copilot AIDER to AUTOMATE testing.

I'm DONE writing tests - using AI copilot AIDER to AUTOMATE testing.

pip install YOUR-PACKAGE: Building your first python with Poetry, AIDER, and ChatGPT

pip install YOUR-PACKAGE: Building your first python with Poetry, AIDER, and ChatGPT

Git + AI = DIFFBRO: AI Coding the future of code reviews (python, aider, gpt-4)

Git + AI = DIFFBRO: AI Coding the future of code reviews (python, aider, gpt-4)

AI Devlog: Coding an AI powered, Code Review, CLI tool | Python, Aider, ChatGPT

AI Devlog: Coding an AI powered, Code Review, CLI tool | Python, Aider, ChatGPT

Introducing DIFFBRO - Your AI powered PEER REVIEWS in one command

Introducing DIFFBRO - Your AI powered PEER REVIEWS in one command

ONE Word Prompts - 3 INSTANTLY useful Prompt Engineering Techniques

ONE Word Prompts - 3 INSTANTLY useful Prompt Engineering Techniques

The Javascript Ecosystem Killer: Using Bun, to Learn Bun (with AIDER)

The Javascript Ecosystem Killer: Using Bun, to Learn Bun (with AIDER)

"With this prompt, I learned Pytest in 12 minutes" - Learn ANYTHING with LLMs

"With this prompt, I learned Pytest in 12 minutes" - Learn ANYTHING with LLMs

Prompt Engineering an ENTIRE codebase: Postgres Data Analytics AI Agent

Prompt Engineering an ENTIRE codebase: Postgres Data Analytics AI Agent

This video teaches viewers how to use GPT-5 for agentic coding with Claude Code, and how to evaluate and optimize the performance of LLMs using various tools and techniques. It also highlights the importance of agentic coding and its applications in AI development.

Key Takeaways

Run GPT-5 on device with 20 billion and 120 billion OpenAI local models
Configure nano agent server and MCP nano prompt nano agent tool
Evaluate models against agentic prompts with clear instruction following and tool use
Use Cloud Code for multimodel evaluation system
Pass in higher order prompt and lower order prompt to nano agent MCP server

💡 Agentic coding is a superset of AI coding, and GPT-5 is a powerful tool for agentic coding tasks, offering cheaper and more efficient performance compared to other models like Opus 4.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related Reads

Claude Sonnet 5 Just Launched. Is It Actually Better Or Just Newer?

Learn how Claude Sonnet 5 compares to other models like Opus 4.8 and GPT 5.6 in terms of pricing, performance, and benchmarking, and understand what these differences mean for your projects

Claude Sonnet 5 Just Launched. Is It Actually Better Or Just Newer?

Learn how Claude Sonnet 5 compares to Frontier models in pricing, performance, and benchmarking, and what this means for your ML projects

Medium · Machine Learning

Claude Sonnet 5 Just Launched. Is It Actually Better Or Just Newer?

Learn how Claude Sonnet 5 compares to Frontier models in terms of pricing, performance, and benchmarking, and understand what these differences mean for your projects

Claude Sonnet 5 Didn’t Just Get Smarter. It Changed the Economics of AI.

Claude Sonnet 5's advancements have transformed the economics of AI, making it more viable for production

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)