Fuzzing in the GenAI Era — Leonard Tang, Haize Labs

AI Engineer · Beginner ·🛡️ AI Safety & Ethics ·8mo ago

Skills: AI Security80%RAG Evaluation70%LLM Engineering60%

"Evaluation" is one of those concepts that every AI practitioner vaguely knows is important, but few practitioners truly understand. Is "eval" the dataset for measuring the quality of your AI system? Is "eval" the measure, the metric of quality? Is "eval" the process of human annotation and scoring? Or is "eval" a third-party dataset run once to benchmark a model? To mitigate this cacophony, this talk will provide an opinionated and principled perspective for what we actually mean when we say “evaluation”, beyond the traditional for-loop-over-a-static dataset. In particular, this perspective draws heavy inspiration from *fuzzing*, i.e. bombarding AI with simulated, unexpected user inputs to uncover corner cases at scale. This factors into sub-problems regarding: - Quality Metric. What is the actual criteria we, as humans, are using to determine if an AI system is producing good or bad responses? How do we elicit these criteria before the human SME can articulate them? How do we, as efficiently as possible, operationalize this criteria with an automated *Judge*? - Stimuli Generation. Given a metric, how do we know, with confidence, that an AI system is performing well with respect to the metric? What data is representative and sufficient for discovering all potential bugs of an AI system? And how do we generate this complex, diverse, faithful data at scale? We will discuss in detail the philosophy, technology, and case studies behind both problems of Quality Metric and Stimuli Generation, and how they interact in concert. Timestamps 00:00 Introduction to Haizing 01:16 The "Last Mile Problem" in AI 02:47 The Brittleness of GenAI Applications 03:54 Examples of Brittle Chatbots 04:29 Inadequacy of Standard Evaluation Methods 06:09 Haizing: Simulating the Last Mile 08:43 Scaling Evaluation with Agents as Judges 09:29 Verdict: Accuracy vs. Latency 11:47 Scaling Evaluation with RL-Tuned Judges 14:06 Fuzzing vs. Adversarial Testing in AI 14:37 Simulation as Prompt Opt

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Engineer · AI Engineer · 0 of 60

← Previous Next →

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Announcing the AI Engineer Network: Benjamin Dunphy

Announcing the AI Engineer Network: Benjamin Dunphy

The 1,000x AI Engineer: Swyx

The 1,000x AI Engineer: Swyx

Building AI For All: Amjad Masad & Michele Catasta

Building AI For All: Amjad Masad & Michele Catasta

The Age of the Agent: Flo Crivello

The Age of the Agent: Flo Crivello

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Pydantic is all you need: Jason Liu

Pydantic is all you need: Jason Liu

Building Blocks for LLM Systems & Products: Eugene Yan

Building Blocks for LLM Systems & Products: Eugene Yan

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

Climbing the Ladder of Abstraction: Amelia Wattenberger

Climbing the Ladder of Abstraction: Amelia Wattenberger

Supabase Vector: The Postgres Vector database: Paul Copplestone

Supabase Vector: The Postgres Vector database: Paul Copplestone

[Workshop] AI Engineering 101

[Workshop] AI Engineering 101

The Hidden Life of Embeddings: Linus Lee

The Hidden Life of Embeddings: Linus Lee

[Workshop] AI Engineering 201: Inference

[Workshop] AI Engineering 201: Inference

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Evolution: Mario Rodriguez, GitHub

The AI Evolution: Mario Rodriguez, GitHub

Move Fast Break Nothing: Dedy Kredo

Move Fast Break Nothing: Dedy Kredo

AI Engineering 201: The Rest of the Owl

AI Engineering 201: The Rest of the Owl

Building Reactive AI Apps: Matt Welsh

Building Reactive AI Apps: Matt Welsh

Pragmatic AI with TypeChat: Daniel Rosenwasser

Pragmatic AI with TypeChat: Daniel Rosenwasser

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Retrieval Augmented Generation in the Wild: Anton Troynikov

Retrieval Augmented Generation in the Wild: Anton Troynikov

Building Production-Ready RAG Applications: Jerry Liu

Building Production-Ready RAG Applications: Jerry Liu

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

The Weekend AI Engineer: Hassan El Mghari

The Weekend AI Engineer: Hassan El Mghari

Harnessing the Power of LLMs Locally: Mithun Hunsur

Harnessing the Power of LLMs Locally: Mithun Hunsur

Trust, but Verify: Shreya Rajpal

Trust, but Verify: Shreya Rajpal

Open Questions for AI Engineering: Simon Willison

Open Questions for AI Engineering: Simon Willison

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

Using AI to Build an Infinite Game: Jeff Schomay

Using AI to Build an Infinite Game: Jeff Schomay

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

The Code AI Maturity Model and What It Means For You: Ado Kukic

The Code AI Maturity Model and What It Means For You: Ado Kukic

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

The Making of Devin by Cognition AI: Scott Wu

The Making of Devin by Cognition AI: Scott Wu

The Future of Knowledge Assistants: Jerry Liu

The Future of Knowledge Assistants: Jerry Liu

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Open Challenges for AI Engineering: Simon Willison

Open Challenges for AI Engineering: Simon Willison

Lessons From A Year Building With LLMs

Lessons From A Year Building With LLMs

From Software Developer to AI Engineer: Antje Barth

From Software Developer to AI Engineer: Antje Barth

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

What's new from Anthropic and what's next: Alex Albert

What's new from Anthropic and what's next: Alex Albert

Using agents to build an agent company: Joao Moura

Using agents to build an agent company: Joao Moura

Decoding the Decoder LLM without de code: Ishan Anand

Decoding the Decoder LLM without de code: Ishan Anand

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building Reliable Agentic Systems: Eno Reyes

Building Reliable Agentic Systems: Eno Reyes

10x Development: LLMs For the working Programmer - Manuel Odendahl

10x Development: LLMs For the working Programmer - Manuel Odendahl

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Hypermode Launch: Kevin Van Gundy

Hypermode Launch: Kevin Van Gundy

Git push get an AI API: Ryan Fox-Tyler

Git push get an AI API: Ryan Fox-Tyler

More on: AI Security

View skill →

Managing Threat Intelligence with Cortex XSOAR

SECURE Your App with Roles and Permissions in Spring Security!

SECURE Your App with Roles and Permissions in Spring Security!

TheCodeAlchemist

Pete Bryan - Scaling Jupyter Notebooks for global scale security analysis | JupyterCon 2020

Pete Bryan - Scaling Jupyter Notebooks for global scale security analysis | JupyterCon 2020

Warning! Python Remote Keylogger (this is really too easy!)

Warning! Python Remote Keylogger (this is really too easy!)

Episode 9: Automating Single-Turn Attacks with PyRIT | AI Red Teaming 101

Episode 9: Automating Single-Turn Attacks with PyRIT | AI Red Teaming 101

Microsoft Developer

Secure Agent Authorization with OAuth 2.0 | Amazon Bedrock AgentCore | Amazon Web Services

Secure Agent Authorization with OAuth 2.0 | Amazon Bedrock AgentCore | Amazon Web Services

Amazon Web Services

Related AI Lessons

Operational continuity is not governability.

Operational continuity and governability are distinct concepts in AI and business, and understanding their differences is crucial for effective management

Medium · Deep Learning

AI gave North Korean hackers a $600 million month. DeFi is still working out how to respond.

AI-powered North Korean hackers stole $600 million from DeFi platforms in one month, highlighting the need for improved security measures

The Next Web AI

The Fallacy of Vibe-Driven Development: A Critical Look at AI Scaling

Learn to critically evaluate AI scaling strategies and avoid the pitfalls of vibe-driven development to ensure effective AI implementation

Dev.to · Aneesha Prasannan

New Jersey’s 2026 AI Push

New Jersey advances AI legislation to combat deepfakes with harsher penalties, including up to 5 years imprisonment and $30,000 fines

Chapters (11)

Introduction to Haizing

1:16 The "Last Mile Problem" in AI

2:47 The Brittleness of GenAI Applications

3:54 Examples of Brittle Chatbots

4:29 Inadequacy of Standard Evaluation Methods

6:09 Haizing: Simulating the Last Mile

8:43 Scaling Evaluation with Agents as Judges

9:29 Verdict: Accuracy vs. Latency

11:47 Scaling Evaluation with RL-Tuned Judges

14:06 Fuzzing vs. Adversarial Testing in AI

14:37 Simulation as Prompt Opt

Don’t Let AI Make You Dumb 🧠 #shorts

Jacky Chou from Indexsy