Beyond Leaderboards: LMArena’s Mission to Make AI Reliable
a16z general partner Anjney Midha sits down with LMArena cofounders Anastasios N. Angelopoulos, Wei-Lin Chiang, and Ion Stoica to talk about the future of AI evaluation.
As benchmarks struggle to keep up with the pace of real-world deployment, LMArena is reframing the problem: what if the best way to test AI models is to put them in front of millions of users and let them vote? The team discusses how Arena evolved from a research side project into a key part of the AI stack, why fresh and subjective data is crucial for reliability, and what it means to build a CI/CD pipeline for large models.
They also explore:
- Why expert-only benchmarks are no longer enough
- How user preferences reveal model capabilities — and their limits
- What it takes to build personalized leaderboards and evaluation SDKs
- And why real-time testing is foundational for mission-critical AI
Chapters:
00:00:04 - LLM evaluation: From consumer chatbots to mission-critical systems
00:06:04 - Style and substance: Crowdsourcing expertise
00:18:51 - Building immunity to overfitting and gaming the system
00:29:49 - The roots of LMArena
00:41:29 - Proving the value of academic AI research
00:48:28 - Scaling LMArena and starting a company
00:59:59 - Benchmarks, evaluations, and the value of ranking LLMs
01:12:13 - The challenges of measuring AI reliability
01:17:57 - Expanding beyond binary rankings as models evolve
01:28:07 - A leaderboard for each prompt
01:31:28 - The LMArena roadmap
01:34:29 - The importance of open source and openness
01:43:10 - Adapting to agents (and other AI evolutions)
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from a16z · a16z · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
a16z Podcast | Money, Risk, and Software
a16z
a16z Podcast | Wall Street's Most Hated Man -- A Conversation With Overstock.com's Patrick Byrne
a16z
a16z Podcast | How Big Companies Can Get the Most From Silicon Valley
a16z
a16z Podcast | The Role of Academia in the Startup World
a16z
a16z Podcast | AMPLab, the Power of Open Source, and the Future of Systems Software
a16z
a16z Podcast | Dell + EMC -- Why the Python Just Ate the Cow
a16z
a16z Podcast | Belief -- An Interview with Oprah Winfrey
a16z
a16z Podcast | Holy Non Sequiturs, Batman: What Disruption Theory Is ... and Isn't
a16z
a16z Podcast | Boards and the Power of Networks
a16z
a16z Podcast | A Whirlwind Tour of Policy Issues in Tech
a16z
a16z Podcast | Beyond Lean Startups
a16z
a16z Podcast | Blockchain vs/and Bitcoin
a16z
a16z Podcast | Quantum Leap
a16z
a16z Podcast | Artificial Intelligence and the 'Space of Possible Minds'
a16z
a16z Podcast | Fintech from the World's Financial Capital -- London
a16z
a16z Podcast | On Recent IPOs and Comparing Private vs. Public Valuations
a16z
a16z Podcast | The Future of Food
a16z
a16z Podcast | Data Down on the Farm
a16z
a16z Podcast | The Data Science of Food and Taste
a16z
a16z Podcast | Using Social Tools to Build Homes for Those Most in Need
a16z
a16z Podcast | London Calling for Tech Done in a Different Way
a16z
a16z Podcast | Building Tech Startups in a Place Where Tech Isn’t Everything
a16z
a16z Podcast | Nootropics and the Best Version of Your Brain, Yourself
a16z
a16z Podcast | Scaling Ideas and Startups in the U.K. and Europe
a16z
a16z Podcast | The Tiger and the Dragon -- On Tech and Startups in India and China
a16z
a16z Podcast | Telepresence and Tech for a Distributed Workforce
a16z
a16z Podcast | The Present State and Future Possibility of Virtual Reality
a16z
a16z Podcast | Writing a New Language of Storytelling with Virtual Reality
a16z
a16z Podcast | Mellody Hobson and Ben Horowitz Talk Investing, Career, and Star Wars!
a16z
a16z Podcast | The Future of Software Development
a16z
a16z Podcast | What Software Developers (and Therefore Every Company) Need
a16z
a16z Podcast | Making the Most of the Data That Matters
a16z
a16z Podcast | Harnessing the DevOps Movement -- Don’t Go Chasing Waterfalls
a16z
a16z Podcast | Nobody Discusses Work Software Outside of Work -- and Then There’s Slack
a16z
a16z Podcast | The Fundamentals of Security and the Story of Tanium’s Growth
a16z
a16z Podcast | Things Come Together -- Truths about Tech in Africa
a16z
a16z Podcast | When Banking Works Like My Smartphone
a16z
a16z Podcast | How to Be Original and Make Big Ideas Happen
a16z
a16z Podcast | The Future of Money and Monetization
a16z
a16z Podcast | Building Affirm, and Why Max Levchin Has Watched Seven Samurai 100-Plus Times
a16z
a16z Podcast | Hall of Fame Football Meets Venture Capital
a16z
a16z Podcast | Breaking the Barriers of Human Potential
a16z
a16z Podcast | 'In the Eye of a Tornado': Views on Innovation from China
a16z
a16z Podcast | Infrastructure... Is Everything
a16z
a16z Podcast | Mobile Falls Hard for Virtual Reality
a16z
a16z Podcast | Disruption in Business... and Life
a16z
a16z Podcast | Data Network Effects
a16z
a16z Podcast | The Dream of AI Is Alive in Go
a16z
a16z Podcast | I Reject the Term Viral Video
a16z
a16z Podcast | Truth and Humanity in Leadership
a16z
a16z Podcast | Your Worst Deeds Don’t Define You -- Life and Redemption in Prison
a16z
a16z Podcast | Investing in (Business and Career) Change
a16z
a16z Podcast | Scaling Companies and Culture
a16z
a16z Podcast | Teams, Trust, and Object Lessons
a16z
a16z Podcast | The Why, How, and When of Sales
a16z
a16z Podcast | Selling to Developers & Open Source Business Models
a16z
a16z Podcast | Connectivity and the Internet as Supply Chain
a16z
a16z Podcast | E-commerce, Payments, & More in India's Evolving Retail Landscape
a16z
a16z Podcast | Banking on the Blockchain
a16z
a16z Podcast | On Corporate Venturing & Setting Up 'Innovation Outposts'
a16z
More on: AI Startup Building
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Forgelab Week 6: $20 MRR, 5 APIs, and the Week I Fixed What You Cannot See
Dev.to · Forgelab Africa
The Rise of Lean Tech Companies: Smaller Teams, Bigger Growth
Medium · Startup
Nature Figured Out Org Design Before We Did
Medium · Startup
Still Thinking About Quitting Your 9-to-5 to Start a Business? Here’s What You Need to Hear.
Medium · Startup
Chapters (13)
0:04
LLM evaluation: From consumer chatbots to mission-critical systems
6:04
Style and substance: Crowdsourcing expertise
18:51
Building immunity to overfitting and gaming the system
29:49
The roots of LMArena
41:29
Proving the value of academic AI research
48:28
Scaling LMArena and starting a company
59:59
Benchmarks, evaluations, and the value of ranking LLMs
1:12:13
The challenges of measuring AI reliability
1:17:57
Expanding beyond binary rankings as models evolve
1:28:07
A leaderboard for each prompt
1:31:28
The LMArena roadmap
1:34:29
The importance of open source and openness
1:43:10
Adapting to agents (and other AI evolutions)
🎓
Tutor Explanation
DeepCamp AI