How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

AI Engineer · Intermediate ·💻 AI-Assisted Coding ·3mo ago

Skills: AI Pair Programming80%AI-Assisted Code Review60%

AI models are crushing benchmarks. SWE-bench scores are climbing, and METR's measured time horizons are rising rapidly. Yet when we deployed these same models in a field study with experienced developers, they didn't speed up work. What's going on? Are benchmarks misleading us about AI capabilities? Are we missing something about how AI performs in the real world? In this talk, we'll reconcile lab and field evidence on AI capabilities. Drawing from METR's time horizon measurements and developer productivity RCT, we'll explore why impressive benchmark performance doesn't always translate to real-world impact. We'll examine potential explanations—from reliability requirements to task distribution to capability elicitation—and discuss what this means for automated AI R&D. https://x.com/joel_bkr Timestamps 00:00 The Compute-Time Horizon Argument 01:43 Potential Constraints on AI Scaling (Power & Dollars) 04:23 The Problem of Eclipsing Evaluation Time 06:52 Meta's "J-Curve" of Developer Productivity 09:12 Unreliability of Self-Reported Time Estimates 11:43 Personal Experiences with AI Tools (Cursor) & Learning Curves 14:10 METR Study Deep Dive: Scatter Plots & Variance 16:48 The Controversy of "Conservative" Usage Estimates 21:41 Unpublished Hackathon Results (AI Allowed vs. Disallowed) 25:28 Why AI Struggles with Data Science & Messy Enterprise Data 30:35 Example of AI Failure on Complex Deployment Metrics 38:29 Quantifying Speed-Up: The Methodological Challenges 46:30 Future Metrics: "Watched" vs. "Unwatched" Time Horizons 52:52 Moving Beyond Benchmarks: "In the Wild" Transcripts 56:12 The "Agent Village" & Fuzzy Goal Measurement 58:53 The "Neurodivergent AI" Hypothesis & Interface Mismatch 01:06:31 Software-Only Singularity vs. Hardware Constraints 01:13:53 AI Applications in Chip Fabrication & Yield Improvement

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Engineer · AI Engineer · 0 of 60

← Previous Next →

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Announcing the AI Engineer Network: Benjamin Dunphy

Announcing the AI Engineer Network: Benjamin Dunphy

The 1,000x AI Engineer: Swyx

The 1,000x AI Engineer: Swyx

Building AI For All: Amjad Masad & Michele Catasta

Building AI For All: Amjad Masad & Michele Catasta

The Age of the Agent: Flo Crivello

The Age of the Agent: Flo Crivello

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Pydantic is all you need: Jason Liu

Pydantic is all you need: Jason Liu

Building Blocks for LLM Systems & Products: Eugene Yan

Building Blocks for LLM Systems & Products: Eugene Yan

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

Climbing the Ladder of Abstraction: Amelia Wattenberger

Climbing the Ladder of Abstraction: Amelia Wattenberger

Supabase Vector: The Postgres Vector database: Paul Copplestone

Supabase Vector: The Postgres Vector database: Paul Copplestone

[Workshop] AI Engineering 101

[Workshop] AI Engineering 101

The Hidden Life of Embeddings: Linus Lee

The Hidden Life of Embeddings: Linus Lee

[Workshop] AI Engineering 201: Inference

[Workshop] AI Engineering 201: Inference

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Evolution: Mario Rodriguez, GitHub

The AI Evolution: Mario Rodriguez, GitHub

Move Fast Break Nothing: Dedy Kredo

Move Fast Break Nothing: Dedy Kredo

AI Engineering 201: The Rest of the Owl

AI Engineering 201: The Rest of the Owl

Building Reactive AI Apps: Matt Welsh

Building Reactive AI Apps: Matt Welsh

Pragmatic AI with TypeChat: Daniel Rosenwasser

Pragmatic AI with TypeChat: Daniel Rosenwasser

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Retrieval Augmented Generation in the Wild: Anton Troynikov

Retrieval Augmented Generation in the Wild: Anton Troynikov

Building Production-Ready RAG Applications: Jerry Liu

Building Production-Ready RAG Applications: Jerry Liu

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

The Weekend AI Engineer: Hassan El Mghari

The Weekend AI Engineer: Hassan El Mghari

Harnessing the Power of LLMs Locally: Mithun Hunsur

Harnessing the Power of LLMs Locally: Mithun Hunsur

Trust, but Verify: Shreya Rajpal

Trust, but Verify: Shreya Rajpal

Open Questions for AI Engineering: Simon Willison

Open Questions for AI Engineering: Simon Willison

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

Using AI to Build an Infinite Game: Jeff Schomay

Using AI to Build an Infinite Game: Jeff Schomay

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

The Code AI Maturity Model and What It Means For You: Ado Kukic

The Code AI Maturity Model and What It Means For You: Ado Kukic

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

The Making of Devin by Cognition AI: Scott Wu

The Making of Devin by Cognition AI: Scott Wu

The Future of Knowledge Assistants: Jerry Liu

The Future of Knowledge Assistants: Jerry Liu

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Open Challenges for AI Engineering: Simon Willison

Open Challenges for AI Engineering: Simon Willison

Lessons From A Year Building With LLMs

Lessons From A Year Building With LLMs

From Software Developer to AI Engineer: Antje Barth

From Software Developer to AI Engineer: Antje Barth

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

What's new from Anthropic and what's next: Alex Albert

What's new from Anthropic and what's next: Alex Albert

Using agents to build an agent company: Joao Moura

Using agents to build an agent company: Joao Moura

Decoding the Decoder LLM without de code: Ishan Anand

Decoding the Decoder LLM without de code: Ishan Anand

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building Reliable Agentic Systems: Eno Reyes

Building Reliable Agentic Systems: Eno Reyes

10x Development: LLMs For the working Programmer - Manuel Odendahl

10x Development: LLMs For the working Programmer - Manuel Odendahl

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Hypermode Launch: Kevin Van Gundy

Hypermode Launch: Kevin Van Gundy

Git push get an AI API: Ryan Fox-Tyler

Git push get an AI API: Ryan Fox-Tyler

More on: AI Pair Programming

View skill →

Build a JavaScript chat bot with us

Build a JavaScript chat bot with us

Live-code an emoji game with us | HTML, CSS & JavaScript

Live-code an emoji game with us | HTML, CSS & JavaScript

Group Coding: Working on the Coupon-API, Part 2

Group Coding: Working on the Coupon-API, Part 2

Can I Make Brick Breaker in One Hour - Coding Challenge

Can I Make Brick Breaker in One Hour - Coding Challenge

Speaking with a Webpage - Streaming Speech Transcripts

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Related AI Lessons

GitHub Copilot Just Changed — Here's What It Means for Devs in 2026

GitHub Copilot's new update introduces an autonomous agent that can refactor entire modules, changing the game for devs in 2026

I Let AI Handle My PR Reviews for 30 Days — The Data Was Ugly

Learn how using an LLM agent for PR reviews can impact code quality and team productivity, and why it's crucial to monitor and adjust AI-driven workflows

Why Developers Who Don’t Use AI Will Become Obsolete

Developers who don't adopt AI will become obsolete, highlighting the importance of AI in software development

I built GhostType: inline AI text completion for every app on macOS

Learn how to build an inline AI text completion tool for macOS apps using GhostType

Dev.to · mk668a

Chapters (18)

The Compute-Time Horizon Argument

1:43 Potential Constraints on AI Scaling (Power & Dollars)

4:23 The Problem of Eclipsing Evaluation Time

6:52 Meta's "J-Curve" of Developer Productivity

9:12 Unreliability of Self-Reported Time Estimates

11:43 Personal Experiences with AI Tools (Cursor) & Learning Curves

14:10 METR Study Deep Dive: Scatter Plots & Variance

16:48 The Controversy of "Conservative" Usage Estimates

21:41 Unpublished Hackathon Results (AI Allowed vs. Disallowed)

25:28 Why AI Struggles with Data Science & Messy Enterprise Data

30:35 Example of AI Failure on Complex Deployment Metrics

38:29 Quantifying Speed-Up: The Methodological Challenges

46:30 Future Metrics: "Watched" vs. "Unwatched" Time Horizons

52:52 Moving Beyond Benchmarks: "In the Wild" Transcripts

56:12 The "Agent Village" & Fuzzy Goal Measurement

58:53 The "Neurodivergent AI" Hypothesis & Interface Mismatch

1:06:31 Software-Only Singularity vs. Hardware Constraints

1:13:53 AI Applications in Chip Fabrication & Yield Improvement

The Coder's Companion: AI's Future