Claude Code: Anthropic's CLI Agent
Skills:
Agentic Coding90%AI Pair Programming80%AI-Assisted Code Review80%Tool Use & Function Calling70%
Key Takeaways
The video discusses Claude Code, Anthropic's CLI agent, which provides access to files and allows bash commands, and its applications in AI coding, including code generation, review, and automation.
Full Transcript
Hey everyone, welcome to the latest and space podcast. This is Allesio, partner and CTO at Decible, and I'm joined by my co-host Swix, founder of Small AI. Hey, and today we're in the studio with Cat Woo and Boris Journey. Welcome. Thanks for having us. Thank you. Uh Cat, you and I know each other from before. I just realized Dagster as well and then Index Ventures and now Enthrovic. Exactly. Um, it's so cool to see like a friend that you know from before like now working in Enthopic and like shipping really cool stuff. And Boris, you were a celebrity cuz like we were just having you outside just getting coffee and people recognized you from your video. Oh wow. Right. That's new. Wasn't that wasn't that neat? Um, yeah. I definitely I had that experience like once or twice in the last few weeks. Yeah, it was surprising. Yeah. Well, thank you for making the time. You're here to talk we're here to talk about cloud code. Most people probably have heard of it. We think like you know quite a few people have tried it but let's get a crisp upfront definition like what is cloud code? Yeah so cloud code is cloud in the terminal. Um so you know cloud has a bunch of different interfaces. There's desktop, there's web and yeah cloud code it runs in your terminal because it runs in the terminal. It has access to a bunch of stuff that you just don't get if you're running on the web or on desktop or whatever. So it can run bash commands. it can see all of the files in the current directory and it does all that agentically and yeah I guess it maybe it comes back to like the maybe the question under the question is like you know where did this idea come from and yeah part of it was we just want to learn how claude um we want to learn how people use agents we doing this with the CLI form factor because coding is kind of a natural place where people use agents today and you know there's kind of product market fit for this thing but yeah it's just sort of this crazy research project and obviously it's kind of bare bones and simple. Um but yeah, it's like a agent in your terminal. That's how the best stuff starts. Yeah. How did did it start? Did you have a master plan to build cloud code or There's no master plan. Uh when I joined Anthropic, I was experimenting with different ways to use the model kind of in different places. And the way I was doing that was through the public API, the same API that everyone else has access to. And one of the really weird experiments was this claw that runs in a terminal. And I was using it for kind of weird stuff. I was using it to like look at what music I was listening to and react to that and then you know like screenshot my you know video player and explain what's happening there and things like this. And this was like kind of a pretty quick thing to build and it was pretty fun to play around with. And then at some point I gave it access to the terminal and the ability to code and suddenly it just felt very useful like I was using this thing every day. It kind of expanded from there. We gave the core team access and they all started using it every day which was pretty surprising. Uh and then we gave all the engineers and researchers at anthropic access and pretty soon everyone was using it every day and I remember we had this DAU chart for internal users and I was just watching it and it it was vertical like for days and we're like all right there's something here. We got to give this to external people so everyone else can try this too. Yeah. And yeah, that's where it came from. And were you also working with Boris already or did this come out and then it started growing and then you're like, "Okay, we need to maybe make this a team so to speak." Yeah, the original team was Boris, Sid, and Ben. And over time, as more people were adopting the tool, we felt like, okay, we really have to invest in supporting it because all our researchers are using it and we this is like our one lever to make them really productive. And so at that point I was using cloud code to build some visualizations. I was analyzing a bunch of data and sometimes it's super useful to like spin up a streamllet and like see all the aggregate stats at once and cloud code made it really really easy to do. So I think I sent Boris like a bunch of feedback and at some point Boris was like do you want to just work on this? And so that's how it happened. It was actually a little like it was more than that on my side. you were sending all this feedback and at the same time we were looking for a PM and we were like looking at a few people and then I remember telling the manager like hey I want cat I'm sure people are curious what's the process within Entropic to like graduate one of these projects like so you have kind of like the a lot of growth then you get a PM when did you decide okay we should like it's ready to be opened up generally at anthropic we have this product principle of do the simple thing first and I think that the way we build product is really based on that principle. So you kind of staff things as little as you can and keep things as scrappy as you can because the constraints are actually pretty helpful. And for this case, we wanted to see some signs of product market fit before we scaled it. Yeah, I imagine so like we're putting out the MCP episode this week and I I imagine MCP also now has a team around it in much the same way. It is now very much officially like sort of like a an anthropic product. So I'm kind of curious for for Cat like how do you view PMing something like this? It's like what is I guess you're like sort of grooming the road map. You're listening to to to users and the velocity is something I've never seen coming out of anthropic. I think I am with a pretty light touch. Um I think Boris and the team are like extremely strong product thinkers and for the vast majority of the features on our road map, it's actually just like people building the thing that they wish that the product had. So very little actually is tops down. I feel like I'm mainly there to like clear the path if anything gets in the way and just make sure that we're all good to go from like a legal, marketing, etc. perspective. Yeah. And then I think like in terms of very broad road map or like long-term road map, um I think the whole team comes together and just thinks about okay, what do we think models will be really good at in 3 months and like let's just make sure that what we're building is really compatible with like the future of what models are capable of. I'd be interested to double click on this. what will models be good at in 3 months? Cuz I think that's something that people always say to think about when building AI products, but nobody knows how to think about it because it's everyone's just like it's generically getting better all the time. We're getting AGI soon, so don't bother, you know, like how do you calibrate 3 months of progress? I think if you look back historically, we tend to ship models every couple of months or so. So 3 months is just like an arbitrary number that I picked. I think the direction that we want our models to go in is being able to accomplish more and more complex tasks with as much autonomy as possible. And so this includes things like making sure that the models are able to explore and find the right information that they need to accomplish a task. Making sure that models are thorough in accomplishing every aspect of a task. Making sure the models can like compose different tools together effectively. Yeah, these are directions we care about. Yeah, I guess it coming back to code, this kind of approach affected the way that we built code also because we know that if we want some product that has like very broad product market fit today, we would build, you know, a cursor or a windsurf or something like this. Like these are awesome products that so many people use every day. I use them. Um, that's not the product that we want to build. We want to build something that's kind of much earlier on that curve and something that will maybe be a big product, you know, a year from now or, you know, however much time from now as the model improves. And that's why code runs in a terminal. It's a lot more bare bones. You have raw access to the model because we didn't spend time building all this kind of nice UI and scaffolding on top of it. When it comes to like the harness so to speak and things you want to put around it, there's one the maybe prompt optimization. So obviously I use cursor every day. There's a lot going on in cursor that is beyond my prompt for like optimization and whatnot, but I know you recently released like you know compacting context features and all that. How do you decide how thick it needs to be on top of the CLI so that's kind of the share interface and at what point are you deciding between okay this should be a part of clock code versus this is just something for the IDE people to figure out for example. Yeah there's kind of three layers at which we can build something. So the you know being a AI company the most natural way to build anything is to just build it into the model and have the model do the behavior. The next layer is probably scaffolding on top. So as like quad code itself and then the layer after that is using cloud code as a tool in a broader workflow. So to compose stuff in you know so for example a lot of people use code with you know t-mox for example to manage a bunch of windows and a bunch of sessions happening in parallel. we don't need to build all of that in. Um, compact, it's sort of this thing that kind of has to live in the middle because it's something that we want to work when you use code. You shouldn't have to pull in extra tools on top of it. And rewriting memory in this way isn't something the model can do today. So, you have to use a tool for it. And so, it it kind of has to within that that middle area. We tried a bunch of different options for compacting, you know, like rewriting uh old tool calls and uh truncating old messages and not new messages. And then in the end we actually just did the simplest thing which is ask quad to summarize the you know the previous messages and just return that and that's it. And it's funny with when the model is so good the simple thing usually works. You don't have to overate. Yeah we do that for cloud plays Pokemon too which is kind of interesting to see that pattern reemerging. And then you have the claw MD file for the more userdriven memories so to speak. It's like kind of like the equivalent of maybe cursor rules I would say. Yeah. And quadmd it's another example of this idea of you know do the simple thing first. We we had all these crazy ideas about like memory architectures and you know there's so much literature about this. There's so many different external products about this and we wanted to be inspired by all the stuff but in the end the thing we did is ship the simplest thing which is you know it's a file that has some stuff and it's auto read into context and there's now a few versions of this file. You can put it in the root or you can put it in child directories or you can put in your home directory and we'll we'll read all of these in kind of different ways but yeah simplest thing that could work. I'm sure you're familiar with ader which is another u thing that people in our discord loved and then when cloud code came out the same people love cloud code. Um any thoughts on like you know inspiration that you took from it things you did differently kind of like maybe design principle in which you went a different way. Yeah, this is uh actually the moment I got AGI pled is related to this. Okay, so maybe I can tell that story. Yeah. Um so Ader inspired this internal tool that we used to have at anthropic called Clyde. So Clyde is like you know CLI quad and that's the predecessor to quad code. It's kind of this research tool that's uh you know it's like written using Python. It takes like a minute to start up. It's like very much written by researchers. It's not a polished product. And when I first joined Enthropic, I was putting up my first poll request. You know, I hand wrote this poll request cuz I didn't know any better. And my boot camp buddy at the time, Adam Wolf, was like, you know, actually, maybe instead of handwriting it, just ask Clyde to write it. And I was like, okay, I guess so. It's a AI lab. Maybe maybe there's some, you know, capability I didn't know about. And so I start up this like terminal tool. And it took like a minute to start up. And I asked Quad, hey, you know, here's the description. Can you make a PR for me? And after a few minutes of chucking along, it made a PR and it worked. And I was just blown away cuz I had no idea. I had just no clue that there were tools that could do this kind of thing. Like I thought that, you know, kind of single line autocomplete was the state-of-the-art before I joined and then that's the moment where I got AGI and yeah, that's where uh code came from. So yeah, it was uh ADER inspired Clyde which inspired quad code. Um so very much big fan of ADR. It's it's an awesome product. I think like people are interested in compare and contrasting obviously because to you obviously this is the house tool you work on it people are interested in like figuring out how to choose between tools there's the cursors of the world there's like Devons of the world there's aers and there's cloud code and you we can't try everything all at once my question would be where do you place it in the universe of options um well you can ask quad to destry all these tools and I wonder what it would no self favoring at all quad play quad plays engineer I don't know we like we use all these tools in house too like we're big fans of all all this stuff like cloud code is uh obviously it's it's a little different than some of these other tools in that it's a lot more raw um like I said there isn't this kind of big beautiful UI on top of it it's raw access to the model it's as raw as it gets so if you want to use a power tool that lets you access the model directly and use claude for um automating you know, big workloads, you know, for example, if you have like a thousand lint violations and you want to start a thousand instances of quad and have it fix each one and make then make a PR, then cloud code is a pretty good tool. Got it. It's it's a tool for power workloads for power users. Um, and I think that's kind of where it fits. Yeah, it's the idea of like parallel versus kind of like single path one way to think about it where like the IDE is really focused on like what you want to do versus like clock code. you kind of more see it as like less supervision required. You can kind of spin up a lot of them. Is that the the right mental model? Yeah. And there's some people at Anthropic that have been racking up like thousands of dollars a day with this kind of automation. Most people don't do anything like that, but you totally could do something like that. Yeah. We we think of it as like a Unix utility, right? So it's like the same way that you would compose, you know, GP or cat or um oh cat or something like that. the same way you can compose code um into workflows. The cost thing is interesting. Do people pay internally or do you get free? If you work at Andra, you can just run this thing as much as you want every day. Um it's for it's for free internally. Nice. Yeah. I I think if everybody had it for free, it would be huge. Um what because like I mean if I think about I pay cursor 20 bucks a month. I use millions and millions of token in cursor that would cost me a lot more in cloud code. And so I think like a lot of people that I've talked to, they don't actually understand how much it costs to do these things. And they'll do a task and they're like, "Oh, that cost 20 cents. I can't believe I paid that much." How do you think going back to like the product side too? It's like how much do you think of that being your responsibility to try and make it more efficient versus that's not really what we're trying to do with the tool? We really see quad code as like the tool that gives you the smartest abilities out of the model. Um, we do care about cost in so far as it's very correlated with latency and we want to make sure that this tool is extremely snappy to use and extremely thorough in its work. We want to be very intentional about all the tokens that it produces. I think we can do more to like communicate the costs with users. Um, currently we're seeing costs around like $6 per day per active user and so it's like it does come out to a bit higher um over the course of a month in cursor. Um, but I don't think it's like out of band and that's like roughly how we're thinking about it. I would add that I think the way I think about it is it's a ROI question. It's not a cost question. And so if you think about you know an average engineer salary and like what you know we were talking about this before before the podcast like engineers are very expensive and if you can make an engineer 50 70% more productive that's worth a lot and I think that's the way to think about it. So if you're saying if you're targeting cloud to be the most powerful end of the spectrum as opposed to the less powerful but faster cheaper side of the spectrum then there there's typically people recommend a waterfall right you try this faster simple one that doesn't work you upgrade you upgrade you upgrade and finally you hit clock code at least for people who are token constrainted that don't work at topic and part of me wants to just fasttrack all that I just want to fan out to everything all at once and once I once I'm not satisfied with the one solution, I would just sort of switch to the next. I I don't know if that's real. Yeah, we're we're definitely trying to make it a little easier to make quad code kind of the tool that you use for all the different workloads. So, for example, we launched uh thinking recently. So, for any kind of planning workload where you might have used other tools before, you can just ask quad and that'll use, you know, chain of thought to to think stuff out. I think we'll get there. Maybe we'll do it this way. How about we recap like sort of the brief history of cloud code like between when you launch and now there there have been quite a few ships. Um how would you highlight the major ones and then we'll get to the the thinking tool? And I think I'd have to like check your Twitter to to remember everything. Um I think a big one that we've gotten a lot of requests for is web fetch. Yep. So we worked really closely with our legal team to make sure that we shipped as secure of an implementation as possible. So um we'll web fetch if a user directly provides an URL whether that's in their call.md or um in their message directly or if a URL is mentioned in one of the previously fetched URLs. And so this way enterprises can feel pretty secure about letting their developers continue to use it. We shipped a bunch of like auto features like autocomplete where you can press tab to complete a file name or file path. Auto compact so that users feel like they have like infinite context since we'll compact behind the scenes. And we also shift auto accept because we noticed that a lot of users were like hey like cloud code can figure it out. I've like developed a lot of trust for cloud code. I wanted to just like autonomously edit my files, run tests, and then come back to me later. So, those are some of the big ones. Uh, Vim mode, custom/comands. People love Vim mode. So, that was a that was a top request, too. That one went pretty viral. Yeah. Yeah. Uh, memory, that was a recent one. So, like the hashtag to remember. So, yeah. I mean, uh, I'd love to dive into, you know, on the technical side, any of them that was particularly challenging. Um a Paul from Ader always says how much of it was coded by Ader you know so then the question is how much of it was coded by cloud code obviously there's some percentage but I wonder if you have a number like 50 80 pretty high probably near 80 I'd say that's very high a lot of human code review though yeah a lot of lot of human code review I think some of the stuff has to be handwritten and some of the code can be written by quad and there's sort of a wisdom in knowing which one to pick and what percent for each kind of task. So, usually where we start is Claude writes the code and then if it's not good, then maybe a human will dive in. There's also some stuff where like I actually prefer to do it by hand. So, it's like, you know, intricate data model refactoring or something. I won't leave it to quad cuz I have really strong opinions and it's easier to just do it and experiment than it is to explain it to Quad. So, yeah, I think that nets out to maybe like 80 90% quadridden code overall. Yeah, we're hearing a lot of that in our portfolio companies like more like series A companies is like 80 85% of the code they write is that generated. Yeah. Yeah. So yeah the well that's a whole different discussion. The custom slash command I had a question. How do you think about custom/comand MCPS like how does this all tie together? You know is the slash command and clock code kind of like an extension of the MCP? Are people building things that should not be MCP but are just kind of like self-contained things in there? How should people think about it? Yeah, I mean obviously we're big fans of MCP. You can use MCP to do a lot of different things. You can use it for custom tools and custom commands and all this stuff, but at the same time you shouldn't have to use it. Um, so if you just want something really simple and local, you just want, you know, some essentially like prompt that's been saved, just use local commands for that. Over time, something that we've been thinking a lot about is how to reexpose things in convenient ways. So, for example, let's say you had this local command. Could you reexpose that as an MCP prompt? Yeah, because cloud code is an MCP client and an MCP server or some let's say you pass in a custom uh you know like a custom bash tool. Is there a way to reexpose that as an MCP tool? Yeah, we think generally you shouldn't have to be tied to a particular technology. You should use whatever works for you. Yeah, because there's some like puppeteer. I think that's like a great way great thing to use with clock code, right, for testing. There's like a puppeteer MCP protocol, but then people can also write their own slash commands. And I'm curious like where MCP are going to end up being, where it's like maybe each slash command leverages MCPS, but no command itself is an MCP because it ends up being customized. I think that's what people are still trying to figure out. It's like should this be in the runtime or in the MCP server? I think people haven't quite figured out where the line is. Yeah, for something like Puppeteer, I think that probably belongs in MCP because there there's a few like tool calls that go in that too. And so it's probably nice to encapsulate that in the MCP server. Whereas slash commands are actually just like prompts, so they're not actually tools. We're thinking about how to expose more customizability options so that people can bring their own tools or turn off some of the tools that uh cloud code comes with. But there's also some trickiness there because um we want to just make sure that the tools people bring are things that COD is able to understand and that people don't accidentally um inhibit their experience by maybe bringing a tool that is like confusing to Claude. So we're just trying to work through the UX of it. Yeah, I'll give an example also of how this stuff connects for quad code. Internally in the GitHub repo, we have this GitHub action that runs. And the GitHub action invokes claude code with a local uh slash command. And the slash command is lint. So it just runs a llinter using cloud. And it's a bunch of things that are pretty tricky to do with a traditional llinter that's based on static analysis. So for example, it'll check for spelling mistakes, but also checks that code matches comments. It also checks that, you know, we use a particular library for network fetches instead of the built-in library. There's a bunch of these specific things that we check that are pretty difficult to express just with lint. And in theory, you can go in and, you know, write a bunch of lint rules for this. Some of it you could cover, some of it you probably couldn't. But honestly, it's much easier to just write a one bullet in markdown in a local command and just commit that. And so what we do is quad runs through the GitHub action. We invoke it with /ro colon lint. So which just invokes that local command. It'll run the llinter. It'll identify any mistakes. It'll make the code changes and then it'll use the GitHub MCP server in order to commit the changes back to the PR. And so you can kind of compose these tools together. And I think that's a lot of the way we think about code is just one tool in an ecosystem that composes nicely without being opinionated about any particular piece. It's interesting. I I I have a weird chapter in my CV uh that makes me I was the CLI maintainer for Nellifi and so I have a little bit of a dive there's a decompilation of cloud code out there that see that has seems has been since been taken down uh but it seems like you use commanderjs and react inc is like the public info about this and I'm just kind of curious like at some point you're just you're not even building cloud code you're kind of just building a general purpose CLI framework that anyone any developer can hack to their purposes. You ever think about this like this level of configurability is more of like a CLI framework or like some new form factor that is doesn't exist before. Yeah, it's definitely been fun to hack on a on a really awesome CLI cuz there's not that many of them. But yeah, we we're big fans of Ink. Um Vadim Dez, we actually used him used React Inc. for a lot of our projects. Oh, cool. Yeah. Yeah. Yeah. Um yeah, Inc. is amazing. It's like it's sort of hacky and janky in a lot of ways. It's like you have you have React and then you're the renderer is just translating the React code to like anti escape codes as the way to render. And there's all sorts of stuff that just doesn't work at all because ANC escape codes are like, you know, it's like this thing that started to be written like the 1970s and there's no really great spec about it. Every terminal is a little different. So building in this way, it feels to me a little bit like uh building for the browser back in the day where you had to think about like Internet Explorer 6 versus Oprah versus like Firefox and whatever. Like you have to think about these cross terminal differences a lot. But yeah, big fans of Ink because it helps abstract over that. We're also uh we use bun. Um so big fans of bun. Um that's been it makes writing our tests and running tests much faster. We don't use it in the runtime yet. It's not just for speed, but you tell me. Yeah. I don't want to I don't want to put words in your mouth, but my impression is they help you ship the compilation, the executable. Yeah, exactly. So, we use bun to to compile the code together. Yeah. Any other pluses of bun? I just want to track bun versus deno conversations. Yeah, because deno's in there, you know. Um I actually haven't used deno back uh it's it's been a while. Um I remember a lot of people say yeah. Yeah, Ryan made it back in the day and it was like there there were some ideas that I think were very cool in it, but yeah, it just never took off to that same degree. Yeah, still a lot of cool ideas like um being able to npm just import from any URL I think is that's the dreaming dream of ESM. Yeah, very cool. Okay. Um I was going to ask you uh one one other feature then we can get to the thinking tool of auto accept. I have this little thing I'm trying to develop thinking around for trust in agents, right? when do you say all right go autonomous when do you pull the pull the developer in and sometimes you let the model decide sometimes you're like this is a distractive action always ask me and I'm just curious if you have any internal heristics around when to auto accept and where all this is going we're spending a lot of time building out the permission system so Robert on our team is leading out this work um we think it's really important to give developers the control to say hey these are like the allowed permissions. Generally, this includes stuff like the model is always allowed to read files or read anything. And then it's up to the user to say, hey, is about to edit files, is to run tests. These are like probably the safest three actions. And then there's like a long list of other actions that um users can either allow list or deny list based on uh reg x matches with the action. Hagen writing a file ever be unsafe if you have version control. I think that's yeah I think it's I think there's like a few different probably like aspects of safety to think about. So it could be useful just to break that out a little bit. So for file editing it's actually less I think about safety although there there is still a safety risk because what might happen is let's say the model fetches a URL and then there's a prompt injection attack in the URL and then the model writes a malicious code to disk and you don't realize it although you know there is code review as like a separate kind of layer there as as protection but I think generally for file rights it the model might just do the wrong thing that's the biggest thing and what we find is that if the model is doing something wrong it's better to identify that earlier and correct it earlier and then you're going to have a better time. If you wait for the model to just go down this like totally wrong path and then correct it 10 minutes later, you're going to have a bad time. So, it's better to usually identify failures away. But at the same time, there's some cases where you just want to let the model go. So, for example, if claude code is uh you know, it's writing tests for me, I'll just hit shift tab, enter auto accept mode, and just let it run the tests and iterate on the tests until they pass. um because I know that's a pretty safe thing to do. And then for some other tools like bash tool, it's pretty different um because quad could ram run, you know, rm rf slash and that would suck, right? That's not a good thing. So we definitely want people to be in the loop to to catch stuff like that. The model is you know trained and aligned to not do that but you know these are non-deterministic systems. So like you still want a human in the loop. Yeah, I think that generally the way that things are trending is um kind of less time between human input. Did you see the meter paper? No. The they establish a Moor's law for time between human input basically and it's basically doubling every 3 to 7 months is the idea. And Enthropic is currently doing super well on that benchmark and it's roughly above autonomous for 50 minutes at the 50th percentile of human effort. Uh which is kind of cool. Highly recommend that. Yeah, I put cursor in yolo mode all the time and just run it. But but it's vibe coding, right? Like this is all of spade. And there's a couple things that are interesting when you talked about alignment and the model being trained. So I always put in a docker container and I have it prefix every command with like the docker compos. And yesterday uh my docker server was not started and I was like oh docker is not running let me just run it outside of docker and I'm like whoa whoa whoa whoa you should start docker and run it in docker you cannot go outside. So That is like a very good example of like you know sometimes you think it's doing something and then it's doing something else. And for the review side it's so I would love to just chat about that more. I think the llinter part that you mentioned I think maybe people skipped it over. It doesn't register the first time but like going from like rulebased linting to like semantic linting I think is like great and super important. And I think a lot of companies are trying to do how do you do autonomous PR review which I've not seen one that I use so far. they're all kind of like mid. So, I'm curious how you think about closing the loop or making that better and figuring out especially like what are you supposed to review because these PRs get pretty big when you buy code. You know, sometimes I'm like, "Oh, wow." Oh, GTM. You know, it's like am I really supposed to read all of this? It kind of seems most of it seems pretty standard, but like I'm sure there are parts in there that the model would understand that are like kind of out of distribution, so to speak, to really look at. So yeah, I know it's a very open-ended question, but any thoughts you have would be great. Yeah, we we have some experiments where Quad is doing code review internally. We're not super happy with the results yet. So it's not something that we want to open up quite yet. The way we're thinking about it is Quad Code is, like I said before, it's a primitive. So if you want to use it to build a code review tool, you can do this. If you want to, you know, build like a security scanning vulnerability scanning tool, you can do that. If you want to build a semantic llinter, you can do that. And hopefully with code it makes it so if you want to do this it's just a few lines of code and you can just have quad write that code also because quad is really great at writing GitHub actions. Yeah. One thing to mention is we do have a non-interactive mode which is like what um cloud uses in these sit or how we use cloud in these situations to automate cloud code and also a lot of our uh the companies using cloud code actually use this non-interactive mode. So they'll for example say hey I have like hundreds of thousands of tests in my repo. Some of them are out ofd some of them are flaky and they'll send cloud code to look at each of these tests and decide okay how can I update any of them? Like should I deprecate some of them? How do I like increase our code coverage? So that's been a really cool way that people are non-interactively using cloud code. What are the best practices here? because when it's non-interactive, it could run forever and you're you're not you're not necessarily reviewing your output of everything, right? So, I'm just kind of curious how does how is it different in non non-interactive mode? What are like the most important hyperparameters or arguments to set? Yeah. And for folks that haven't used those, so non-interactive mode is just quad-p and then you pass in the prompt in quotes and that's all it is. It's just the -p flag. Generally, it's best for tasks that are read only. that's the place where it works really well and you don't you know super have to think about permissions and running forever and and things like that. Um so for example llinter that runs and doesn't fix any issues or for example we're working on a thing where we use quad in with -p to generate the change log for quad. So every PR is just looking over the commit history and being like okay this makes it into the change log this doesn't um because we know people have been requesting uh change log so we're just getting quad to build it. So generate non-interactive mode really good for readonly tasks for tasks where you want to write the thing we usually recommend is pass in a very specific set of permissions on the command line. So what you can do is pass in uh d-allowed tools and then you can allow a specific tool. So for example not just bash but for example get status um or get diff. So you just give it a set of tools that it can use or you know edit tool. Uh it still has default tools are file read GP system tools like bash and ls and memory tools right all those are so it still has yeah it still has all these tools but a lot of tools just lets you instead of the permission prompt because you don't have that in the non-interactive mode it's just kind of pre-accepting uses and we'd also definitely recommend that you start small so like test it on one test make sure that it has reasonable behavior iterate on your prompt then scale it up to 10 make sure that it succeeds or if it fails just like analyze what the patterns of failures are and gradually scale up from there. So definitely don't kick off a run to fix like 100,000 tests. Yeah, I think the so at this point I just you know I I want to this tagline is in my head that basically at anthropic there's cloud code generating code and then cloud code also reviewing its own code that like at some point right like different people are setting all this up you don't really govern that u but it's happening at some yeah we have to be you know at anthropic there's still a human in the loop we're reviewing and I think for you know for ASL this is important so like for general like model alignment and safety what's what's ASL oh so ASL this It's like the kind of the safety levels. Yeah. Right. Right. For what does it stand for? Autonomous safety level. Autonomous. It's essentially like it's like a Sorry, I don't I'm not used to the acronyms. Yeah, we have a lot of these. You But you've published stuff like I know. I just don't know what they're called internally. Yeah, exactly. But it's essentially like as the model gets more capable and it hits, you know, ASL5 is kind of the highest level. It's like you know the model is capable of fooling a user if it wants to and kind of exfiltrate itself like inject itself out of its container and replicate itself across other containers. We're not elizer yukoski ah like yeah this is like where the line goes vertical. We're at two. We're at two right now. Yeah we're kind of bordering on three right now. Yeah. So I think at like three, four and five you you start having to think a lot more carefully about this because hopefully the model is aligned but in case it's not aligned you need a human in the loop in in the right ways. The point of the thing I was thinking about was we have you know VPs of engine CTO's listening like it's this is all well and good for the individual developer but the people who are responsible for the tech the entire code base the engineering decisions all this is going on my developers like I I manage like a 100 developers any of them could be doing any of this at this point what do I do to manage this how does my code review process change how does my change management change I don't know we've talked to a lot of VPs and CTO's about it, they actually tend to be quite excited because they experiment with the tool, they download it, they ask it a few questions and like cloud code when it gives them sensible answers, they're really excited because they're like, "Oh, I can understand like this nuance in the codebase and sometimes they even ship small features with quad code and I think through that process of like interacting with the tool um they build a lot of trust in it and a lot of folks actually come to us and they ask us like how how can I roll out more broadly. Um, and then we'll often like have sessions with like VPs of dev prod and talk about these concerns around how do we make sure people are writing high quality code. I think in general it's still very much up to the individual developer to hold themselves up to a very high standard for the quality of code that they merge. Even if we use quad code to write a lot of our code, it's still up to the individual who merges it to be responsible for like this being well-maintained, well doumented code that has like reasonable abstractions. And so I I think that's something that will continue to happen where quad code isn't its own engineer that's like committing code by itself. It's still very much up to the IC's to be responsible for the code that's produced. Yeah, I think cloud code also makes a lot of this stuff a lot of quality work becomes a lot easier. So for example like I have not manually written a unit test in many months and we have a lot of unit test. We have a lot of unit tests and it's because quad writes all the tests and you know before I felt like a jerk if on someone's PR I'm like hey can you write a test cuz you know they kind of know they coverage is that coverage. Yeah. Okay. And you know, they kind of know they should probably write a test and that's probably the right thing to do. And somewhere in their head they made that trade-off where they just want to ship faster. And so you always kind of feel like a jerk for asking, but now I always ask because Quad can just write the test, right? And you know there's no human work. You just ask Quad to do it and it it writes it. And I think with writing tests becoming easier and with writing lint rules becoming easier, it's actually much easier to have high quality code than than it was before. What are the metrics that you believe in? like is it a lot of people actually don't believe in 100% code coverage because sometimes that is kind of optimizing for the wrong thing arguably I don't know uh but like obviously you have a lot of experience in different code quality metrics but what what what is still what still makes sense I think it's very engineering team dependent honestly I wish there was a one size fits all answer yeah like for me the one solution for uh for some teams test coverage is extremely important um for other teams type coverage is very important Especially if you're working in, you know, a very strictly typed language. And, you know, for example, avoiding like NES and JavaScript and Python. Y I think complexity kind of gets a lot of flack, but it's still honestly a pretty good metric just cuz there isn't anything better in terms of ways to measure code quality. Okay. And then productivity, obviously not lines of code, but do you care about measuring productivity? I'm sure you do. Yeah. You know, lines of code honestly isn't terrible. Oh god, it's uh it has downsides. Yeah, it's it's terri Well, it lines of code is terrible for a lot of reasons. Yes. But it's really hard to make anything better. So, it's the least terrible. It's the least terrible. There's like lines of code maybe like number of PRs, how green your GitHub is. Yeah. Yeah. Yeah. The two that we're really trying to nail down are one decrease in cycle time. So, how much faster are your features shipping because you're using these tools. So that might be something like the time between first commit and when your PR is merged. It's very tricky to get right, but one of the ones that we're targeting. The other one that we want to measure more rigorously is like the number of features that you wouldn't have otherwise built. We have a lot of channels where we get customer feedback and one of the patterns that we've seen with cloud code is that sometimes customer support or customer success will like post hey like um this app has like this bug and then sometimes 10 minutes later one of the engineers on that team will be like cloud code made a fix for it. And a lot of those situations when you like ping them and you're like, "Hey, that was really cool." They were like, "Yeah, um, without cloud code, I probably wouldn't have done that because it would have been too much of a divergence from what I was otherwise going to do. It would have just ended up in this long backlog." So, this is the kind of stuff that we really want to measure more rigorously. That was the other AGI pilled moment for me. There was a really early version of quad code many, many months ago. And this one engineer at Enthropic Jeremy built a bot that looked through a particular feedback channel on Slack and he hooked it up to code to have code automatically put up PRs with just fixes to all the stuff and some of the stuff you know it couldn't fix every issue but it fixed a lot of the issues and I was like 10% 50 you know this was like early on so I don't remember the number but it was it was surprisingly high to the point where I became a believer I see in this kind of workflow and I I wasn't before. SOPM isn't that scary too in a way where you can build too many things. It's almost like maybe you shouldn't build that many things. I think that's what I'm struggling with the most. It's like it gives you the ability to create create create but then at some point you got to support support. This is the Jurassic Park like your scientist is so preoccupied with whether you could. Yeah. Yeah. Exactly. But no, we should uh Yeah. How do you make decisions like now that the cost of actually implementing the thing is going down as a PM? How do you decide what is actually worth doing? Yeah, we definitely still hold a very high bar for net new features. Most of the fixes were like, hey, this functionality is broken or this like there's a weird edge case that we hadn't addressed yet. So, it was very much like smoothing out the rough edges as opposed to building something completely net new. For net new features, I think we hold a pretty high bar that it's very intuitive to use. The new user experience is like minimal. It's just like obvious that it works. We sometimes actually use cloud code to prototype instead of using docs. Yeah. So you'll have like prototypes that you can play around with and that often gives us a faster feel for hey is this feature ready yet or like is this the right abstraction? Is this the right interaction pattern? So it gets us faster to feeling really confident about a feature, but it's it doesn't circumvent the process of us making sure that the feature definitely fits in like the product vision. It's interesting how as it gets easier to build stuff, it changes the way that I write software where like like K saying like before I would write a big design doc and I would think about a problem for a long time before I would build it sometimes for some set of problems and now I'll just ask quad code to prototype like three versions of it and I'll try the feature and see which one I like better and then that informs me much better and much faster than a doc would have. Yeah, I think we haven't totally internalized that transition yet in the industry. Yeah, I feel the same the same way for some tools I build internally. People ask me, could we do this? And I'm like, I'll just Yeah, just build it. It's like, well, I feel it feels pretty good. We should like polish it, you know, or sometimes it's like, no, that's not. It's comforting that, you know, like that your up your max cost is I mean your even at anthropic where it's theoretically unlimited, the cost is roughly $6 a day. That gives people peace of mind because I'm like, $6 a day, fine. $600 a day, we have to talk, like, you know. Yeah. I pay 200 bucks a month to make Studio Gibble photos. So, it's all it's all good. That is totally worth it. You mentioned internal tools and that's actually a really big use case that we're seeing emerge because a lot of times um if you're working on something operationally intensive, if you can spin up a internal dashboard for it or like an operational tool where you can for example grant access to a thousand emails at once, a lot of these things you don't really need to have like a super polished design. You kind of just need something that works. And Quad Code's really good at those kinds of 0ero to one tasks. Like we use Streamlit internally and there's been like a proliferation of how much we're able to visualize and because we're able to visualize it, we're able to see patterns that we wouldn't have otherwise if we were just looking at like raw data. Yeah. Like I I was working on also this like side website uh last week and I just showed Cloud Code the mock. So I just took the you know the screenshot I had dragged and dropped it into the terminal and I was like hey quad here's the mock can you implement it and it implemented it like you know it sort of worked. It was a little bit crummy and I was like all right now look at it in puppeteer and like iterate on it until it looks like the mock and then it did that three or four times and then the thing looked like the mock. Yeah, this was just all manual work before. I think we're going to ask about like two other features of I guess the the overall agent uh pieces that we mentioned. So I'm interested in memory as well. So we talked about autoco compact and memory using hashtags and stuff. My impression is that your you like you say simplest approach works but I'm curious if you've seen any other requests that are interesting to you or internal hacks of memory that people have explored that like you know you might want to surface to others. There's a bunch of different approaches to memory. Most of them use external stores of various sorts. Uh there's chroma. Yeah, exactly. Yeah, there there's a lot of projects like that and uh yeah, it's it's either way K value or kind of like graphs that's like the two big shapes for these. Um you believer in knowledge graphs for this stuff or you know I'm a big I if you talked to me before I joined enthropic and this team I would have said yeah definitely um but now actually I feel everything is the model like that's the thing that wins in the end and it just as the model gets better it's it subsumes everything else so you know at some point the model will encode its own knowledge graph it'll encode its own like KV store if you just give it the right tools. Yeah, but yeah, I think the the specific tools there's still a lot of room for experimentation that we just we don't we don't know yet. In some ways, are we just coping for lack of context length? Like are we doing things for memory now that if we had like a 100 million token
Original Description
More info: https://docs.anthropic.com/en/docs/claude-code/overview
The AI coding wars have now split across four battlegrounds:
1. AI IDEs: with two leading startups in Windsurf ($3B acq. by OpenAI) and Cursor ($9B valuation) and a sea of competition behind them (like Cline, Github Copilot, etc).
2. Vibe coding platforms: Bolt.new, Lovable, v0, etc. all experiencing fast growth and getting to the tens of millions of revenue in months.
3. The teammate agents: Devin, Cosine, etc. Simply give them a task, and they will get back to you with a full PR (with mixed results)
4. The cli-based agents: after Aider’s initial success, we are now seeing many other alternatives including two from the main labs: OpenAI Codex and Claude Code. The main draw is that 1) they are composable 2) they are pay as you go based on tokens used.
Since we covered all three of the first categories, today’s guests are Boris and Cat, the lead engineer and PM for Claude Code. If you only take one thing away from this episode, it’s this piece from Boris: Claude Code is not a product as much as it’s a Unix utility.
This fits very well with Anthropic’s product principle: “do the simple thing first.” Whether it’s the memory implementation (a markdown file that gets auto-loaded) or the approach to prompt summarization (just ask Claude to summarize), they always pick the smallest building blocks that are useful, understandable, and extensible. Even major features like planning (“/think”) and memory (#tags in markdown) fit the same idea of having text I/O as the core interface. This is very similar to the original UNIX design philosophy:
Claude Code is also the most direct way to consume Sonnet for coding, rather than going through all the hidden prompting and optimization than the other products do. You will feel that right away, as the average spend per user is $6/day on Claude Code compared to $20/mo for Cursor, for example. Apparently, there are some engineers inside of Anthropic that have spen
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Latent Space · Latent Space · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Ep 18: Petaflops to the People — with George Hotz of tinycorp
Latent Space
FlashAttention-2: Making Transformers 800% faster AND exact
Latent Space
RWKV: Reinventing RNNs for the Transformer Era
Latent Space
Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai
Latent Space
RAG is a hack - with Jerry Liu of LlamaIndex
Latent Space
The End of Finetuning — with Jeremy Howard of Fast.ai
Latent Space
Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue
Latent Space
Powering your Copilot for Data - with Artem Keydunov from Cube.dev
Latent Space
Beating GPT-4 with Open Source Models - with Michael Royzen of Phind
Latent Space
The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis
Latent Space
The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph
Latent Space
The AI-First Graphics Editor - with Suhail Doshi of Playground AI
Latent Space
The Accidental AI Canvas - with Steve Ruiz of tldraw
Latent Space
The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert
Latent Space
The Four Wars of the AI Stack - Dec 2023 Recap
Latent Space
The State of AI in production — with David Hsu of Retool
Latent Space
Building an open AI company - with Ce and Vipul of Together AI
Latent Space
Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal
Latent Space
A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate
Latent Space
Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI
Latent Space
Making Transformers Sing - with Mikey Shulman of Suno
Latent Space
A Comprehensive Overview of Large Language Models - Latent Space Paper Club
Latent Space
Why Google failed to make GPT-3 -- with David Luan of Adept
Latent Space
Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI
Latent Space
Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit
Latent Space
Breaking down the OG GPT Paper by Alec Radford
Latent Space
High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor
Latent Space
This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)
Latent Space
LLM Asia Paper Club Survey Round
Latent Space
How to train a Million Context LLM — with Mark Huang of Gradient.ai
Latent Space
How AI is Eating Finance - with Mike Conover of Brightwave
Latent Space
How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)
Latent Space
State of the Art: Training 70B LLMs on 10,000 H100 clusters
Latent Space
The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka
Latent Space
Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI
Latent Space
[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models
Latent Space
Synthetic data + tool use for LLM improvements 🦙
Latent Space
RLHF vs SFT to break out of local maxima 📈
Latent Space
The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)
Latent Space
Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson
Latent Space
Answer.ai & AI Magic with Jeremy Howard
Latent Space
Is finetuning GPT4o worth it?
Latent Space
Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind
Latent Space
Building AGI with OpenAI's Structured Outputs API
Latent Space
Q* for model distillation 🍓
Latent Space
Finetuning LoRAs on BILLIONS of tokens 🤖
Latent Space
Cursor UX team is CRACKED 💻
Latent Space
Choosing the BEST OpenAI model 🏆
Latent Space
How will OpenAI voice mode change API design?
Latent Space
STEALING OpenAI models data 🥷
Latent Space
[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!
Latent Space
[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval
Latent Space
The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org
Latent Space
llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE
Latent Space
Prompt Engineer is NOT a job 📝
Latent Space
Prompt Mining LLMs for better prompts ⛏️
Latent Space
The six pillars of few-shot prompting 🔧
Latent Space
Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph
Latent Space
[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)
Latent Space
Can you separate intelligence and knowledge?
Latent Space
More on: Agentic Coding
View skill →Related Reads
📰
📰
📰
📰
GitHub Copilot Is Rewriting How You Think About Database Design — And Not in a Good Way
Dev.to AI
I Used Cursor, Claude Code, and GitHub Copilot on the Same Laravel Feature.
Medium · Programming
Switching from Claude Code to Grok – Same Interface, Different Model
Dev.to · Dragos Roua
AI Text Enhancer – Full Technical Implementation Guide to Deblur Text in Images
Dev.to · Daniyal khan
🎓
Tutor Explanation
DeepCamp AI