Fuzzing in the GenAI Era — Leonard Tang, Haize Labs

AI Engineer · Beginner ·🛡️ AI Safety & Ethics ·8mo ago
"Evaluation" is one of those concepts that every AI practitioner vaguely knows is important, but few practitioners truly understand. Is "eval" the dataset for measuring the quality of your AI system? Is "eval" the measure, the metric of quality? Is "eval" the process of human annotation and scoring? Or is "eval" a third-party dataset run once to benchmark a model? To mitigate this cacophony, this talk will provide an opinionated and principled perspective for what we actually mean when we say “evaluation”, beyond the traditional for-loop-over-a-static dataset. In particular, this perspective draws heavy inspiration from *fuzzing*, i.e. bombarding AI with simulated, unexpected user inputs to uncover corner cases at scale. This factors into sub-problems regarding: - Quality Metric. What is the actual criteria we, as humans, are using to determine if an AI system is producing good or bad responses? How do we elicit these criteria before the human SME can articulate them? How do we, as efficiently as possible, operationalize this criteria with an automated *Judge*? - Stimuli Generation. Given a metric, how do we know, with confidence, that an AI system is performing well with respect to the metric? What data is representative and sufficient for discovering all potential bugs of an AI system? And how do we generate this complex, diverse, faithful data at scale? We will discuss in detail the philosophy, technology, and case studies behind both problems of Quality Metric and Stimuli Generation, and how they interact in concert. Timestamps 00:00 Introduction to Haizing 01:16 The "Last Mile Problem" in AI 02:47 The Brittleness of GenAI Applications 03:54 Examples of Brittle Chatbots 04:29 Inadequacy of Standard Evaluation Methods 06:09 Haizing: Simulating the Last Mile 08:43 Scaling Evaluation with Agents as Judges 09:29 Verdict: Accuracy vs. Latency 11:47 Scaling Evaluation with RL-Tuned Judges 14:06 Fuzzing vs. Adversarial Testing in AI 14:37 Simulation as Prompt Opt
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Engineer · AI Engineer · 0 of 60

← Previous Next →
1 AI Engineer Summit 2023 — DAY 1 Livestream
AI Engineer Summit 2023 — DAY 1 Livestream
AI Engineer
2 AI Engineer Summit 2023 — DAY 2 Livestream
AI Engineer Summit 2023 — DAY 2 Livestream
AI Engineer
3 Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)
Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)
AI Engineer
4 Announcing the AI Engineer Network: Benjamin Dunphy
Announcing the AI Engineer Network: Benjamin Dunphy
AI Engineer
5 The 1,000x AI Engineer: Swyx
The 1,000x AI Engineer: Swyx
AI Engineer
6 Building AI For All: Amjad Masad & Michele Catasta
Building AI For All: Amjad Masad & Michele Catasta
AI Engineer
7 The Age of the Agent: Flo Crivello
The Age of the Agent: Flo Crivello
AI Engineer
8 See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman
See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman
AI Engineer
9 Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase
Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase
AI Engineer
10 Pydantic is all you need: Jason Liu
Pydantic is all you need: Jason Liu
AI Engineer
11 Building Blocks for LLM Systems & Products: Eugene Yan
Building Blocks for LLM Systems & Products: Eugene Yan
AI Engineer
12 The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer
The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer
AI Engineer
13 Climbing the Ladder of Abstraction: Amelia Wattenberger
Climbing the Ladder of Abstraction: Amelia Wattenberger
AI Engineer
14 Supabase Vector: The Postgres Vector database: Paul Copplestone
Supabase Vector: The Postgres Vector database: Paul Copplestone
AI Engineer
15 [Workshop] AI Engineering 101
[Workshop] AI Engineering 101
AI Engineer
16 The Hidden Life of Embeddings: Linus Lee
The Hidden Life of Embeddings: Linus Lee
AI Engineer
17 [Workshop] AI Engineering 201: Inference
[Workshop] AI Engineering 201: Inference
AI Engineer
18 The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex
The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex
AI Engineer
19 The AI Evolution: Mario Rodriguez, GitHub
The AI Evolution: Mario Rodriguez, GitHub
AI Engineer
20 Move Fast Break Nothing: Dedy Kredo
Move Fast Break Nothing: Dedy Kredo
AI Engineer
21 AI Engineering 201: The Rest of the Owl
AI Engineering 201: The Rest of the Owl
AI Engineer
22 Building Reactive AI Apps: Matt Welsh
Building Reactive AI Apps: Matt Welsh
AI Engineer
23 Pragmatic AI with TypeChat: Daniel Rosenwasser
Pragmatic AI with TypeChat: Daniel Rosenwasser
AI Engineer
24 Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan
Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan
AI Engineer
25 Retrieval Augmented Generation in the Wild: Anton Troynikov
Retrieval Augmented Generation in the Wild: Anton Troynikov
AI Engineer
26 Building Production-Ready RAG Applications: Jerry Liu
Building Production-Ready RAG Applications: Jerry Liu
AI Engineer
27 120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson
120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson
AI Engineer
28 The Weekend AI Engineer: Hassan El Mghari
The Weekend AI Engineer: Hassan El Mghari
AI Engineer
29 Harnessing the Power of LLMs Locally: Mithun Hunsur
Harnessing the Power of LLMs Locally: Mithun Hunsur
AI Engineer
30 Trust, but Verify: Shreya Rajpal
Trust, but Verify: Shreya Rajpal
AI Engineer
31 Open Questions for AI Engineering: Simon Willison
Open Questions for AI Engineering: Simon Willison
AI Engineer
32 Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD
Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD
AI Engineer
33 GPT Web App Generator - 10,000 apps created in a month: Matija Sosic
GPT Web App Generator - 10,000 apps created in a month: Matija Sosic
AI Engineer
34 Using AI to Build an Infinite Game: Jeff Schomay
Using AI to Build an Infinite Game: Jeff Schomay
AI Engineer
35 How to Become an AI Engineer from a Fullstack Background - Reid Mayo
How to Become an AI Engineer from a Fullstack Background - Reid Mayo
AI Engineer
36 The Code AI Maturity Model and What It Means For You: Ado Kukic
The Code AI Maturity Model and What It Means For You: Ado Kukic
AI Engineer
37 AI Engineer World’s Fair 2024 - Keynotes & Multimodality track
AI Engineer World’s Fair 2024 - Keynotes & Multimodality track
AI Engineer
38 From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet
From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet
AI Engineer
39 The Making of Devin by Cognition AI: Scott Wu
The Making of Devin by Cognition AI: Scott Wu
AI Engineer
40 The Future of Knowledge Assistants: Jerry Liu
The Future of Knowledge Assistants: Jerry Liu
AI Engineer
41 Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney
Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney
AI Engineer
42 Open Challenges for AI Engineering: Simon Willison
Open Challenges for AI Engineering: Simon Willison
AI Engineer
43 Lessons From A Year Building With LLMs
Lessons From A Year Building With LLMs
AI Engineer
44 From Software Developer to AI Engineer: Antje Barth
From Software Developer to AI Engineer: Antje Barth
AI Engineer
45 Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner
Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner
AI Engineer
46 Copilots Everywhere: Thomas Dohmke and Eugene Yan
Copilots Everywhere: Thomas Dohmke and Eugene Yan
AI Engineer
47 Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han
Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han
AI Engineer
48 Low Level Technicals of LLMs: Daniel Han
Low Level Technicals of LLMs: Daniel Han
AI Engineer
49 Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta
Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta
AI Engineer
50 How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou
How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou
AI Engineer
51 What's new from Anthropic and what's next: Alex Albert
What's new from Anthropic and what's next: Alex Albert
AI Engineer
52 Using agents to build an agent company: Joao Moura
Using agents to build an agent company: Joao Moura
AI Engineer
53 Decoding the Decoder LLM without de code: Ishan Anand
Decoding the Decoder LLM without de code: Ishan Anand
AI Engineer
54 Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner
Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner
AI Engineer
55 Building with Anthropic Claude: Prompt Workshop with Zack Witten
Building with Anthropic Claude: Prompt Workshop with Zack Witten
AI Engineer
56 Building Reliable Agentic Systems: Eno Reyes
Building Reliable Agentic Systems: Eno Reyes
AI Engineer
57 10x Development: LLMs For the working Programmer - Manuel Odendahl
10x Development: LLMs For the working Programmer - Manuel Odendahl
AI Engineer
58 Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner
Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner
AI Engineer
59 Hypermode Launch: Kevin Van Gundy
Hypermode Launch: Kevin Van Gundy
AI Engineer
60 Git push get an AI API: Ryan Fox-Tyler
Git push get an AI API: Ryan Fox-Tyler
AI Engineer

Related AI Lessons

Operational continuity is not governability.
Operational continuity and governability are distinct concepts in AI and business, and understanding their differences is crucial for effective management
Medium · Deep Learning
AI gave North Korean hackers a $600 million month. DeFi is still working out how to respond.
AI-powered North Korean hackers stole $600 million from DeFi platforms in one month, highlighting the need for improved security measures
The Next Web AI
The Fallacy of Vibe-Driven Development: A Critical Look at AI Scaling
Learn to critically evaluate AI scaling strategies and avoid the pitfalls of vibe-driven development to ensure effective AI implementation
Dev.to · Aneesha Prasannan
New Jersey’s 2026 AI Push
New Jersey advances AI legislation to combat deepfakes with harsher penalties, including up to 5 years imprisonment and $30,000 fines
Dev.to AI

Chapters (11)

Introduction to Haizing
1:16 The "Last Mile Problem" in AI
2:47 The Brittleness of GenAI Applications
3:54 Examples of Brittle Chatbots
4:29 Inadequacy of Standard Evaluation Methods
6:09 Haizing: Simulating the Last Mile
8:43 Scaling Evaluation with Agents as Judges
9:29 Verdict: Accuracy vs. Latency
11:47 Scaling Evaluation with RL-Tuned Judges
14:06 Fuzzing vs. Adversarial Testing in AI
14:37 Simulation as Prompt Opt
Up next
Don’t Let AI Make You Dumb 🧠 #shorts
Jacky Chou from Indexsy
Watch →