How to Systematically Setup LLM Evals (Metrics, Unit Tests, LLM-as-a-Judge)

Dave Ebbelaar (LLM Eng) · Intermediate ·🤖 AI Agents & Automation ·8mo ago

Skills: LLM Engineering90%Agent Foundations70%Tool Use & Function Calling60%

Want to learn real AI Engineering? Go here: https://go.datalumina.com/iIO93Ps Want to start freelancing? Let me help: https://go.datalumina.com/vCTpbki 💼 Need help with a project? Work with me: https://go.datalumina.com/TMGbUvO 🔗 Download the free resources https://go.datalumina.com/QFs1X6H 🛠️ My VS Code / Cursor Setup https://youtu.be/mpk4Q5feWaw ⏱️ Timestamps 0:00 Introduction to Agentic AI Applications 1:54 Understanding LLM Evaluations 4:54 Core Challenges in LLM Development 7:54 Importance of Iteration and Improvement 9:21 Defining Evaluations in AI Systems 11:04 The Analyze, Measure, Improve Cycle 12:26 Levels of Evaluations 14:01 Unit Tests for LLMs 17:53 Human and Model Evaluations 22:44 Aligning LLM Evaluators 29:02 Process for Building Automated Evaluators 31:21 A/B Testing in AI Applications 34:40 Evaluation Metrics Overview 37:25 Common Mistakes to Avoid 39:46 Key Principles for Success 42:24 Conclusion and Next Steps 📌 Description In this video, I go over the complete evaluation framework we use at Datalumina to systematically improve AI applications, taking you from basic unit tests all the way through human-aligned model evaluations and A/B testing. I share the exact process that separates the top 5% of AI engineers from those whose projects fail, including tools and code examples you can implement immediately to avoid becoming part of the 95% failure rate. 👋🏻 About Me Hi! I'm Dave, AI Engineer and founder of Datalumina®. On this channel, I share practical tutorials that teach developers how to build production-ready AI systems that actually work in the real world. Beyond these tutorials, I also help people start successful freelancing careers. Check out the links above to learn more!

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Dave Ebbelaar · Dave Ebbelaar · 0 of 60

← Previous Next →

How to Install Homebrew on Mac (Getting Started)

How to Install Homebrew on Mac (Getting Started)

How to Install Python on Mac (Homebrew)

How to Install Python on Mac (Homebrew)

How to Install Anaconda on Mac (Getting Started)

How to Install Anaconda on Mac (Getting Started)

How to Set up VS Code for Data Science & AI

How to Set up VS Code for Data Science & AI

How to Use Git in VS Code for Data Science

How to Use Git in VS Code for Data Science

Data Science Desk Setup to Maximize Productivity

Data Science Desk Setup to Maximize Productivity

THIS Is How I Write Clean Data Science Code EVERY TIME

THIS Is How I Write Clean Data Science Code EVERY TIME

Data Science Tutorial - Project Structure

Data Science Tutorial - Project Structure

Changing rcParams for Better Data Science Plots | Matplotlib Tutorial

Changing rcParams for Better Data Science Plots | Matplotlib Tutorial

How to Read Excel Files with Python (Pandas Tutorial)

How to Read Excel Files with Python (Pandas Tutorial)

My Data Science Journey (Zero to Freelance)

My Data Science Journey (Zero to Freelance)

How I Automate Data Visualization in Python

How I Automate Data Visualization in Python

16 Apps I Use Daily as a Data Scientist

16 Apps I Use Daily as a Data Scientist

How to Manage Conda Environments for Data Science

How to Manage Conda Environments for Data Science

How to Export Machine Learning Models in Python

How to Export Machine Learning Models in Python

VS Code Speed Hack for Data Science

VS Code Speed Hack for Data Science

17 VS Code Tips That Will Change Your Data Science Workflow

17 VS Code Tips That Will Change Your Data Science Workflow

How to Predict the Future with Python (Forecasting Tutorial)

How to Predict the Future with Python (Forecasting Tutorial)

How to Use Python Environment Variables

How to Use Python Environment Variables

7 Data Science Tips for Beginners in 2023

7 Data Science Tips for Beginners in 2023

How to Effectively Use the Data Science Lifecycle

How to Effectively Use the Data Science Lifecycle

Full Machine Learning Project — Coding a Fitness Tracker with Python (Part 1)

Full Machine Learning Project — Coding a Fitness Tracker with Python (Part 1)

Full Machine Learning Project — Processing Raw Data (Part 2)

Full Machine Learning Project — Processing Raw Data (Part 2)

Full Machine Learning Project — Data Visualization with Matplotlib (Part 3)

Full Machine Learning Project — Data Visualization with Matplotlib (Part 3)

This Will Change Data Science as We Know It (ChatGPT)

This Will Change Data Science as We Know It (ChatGPT)

Full Machine Learning Project — Detecting Outliers in Sensor Data (Part 4)

Full Machine Learning Project — Detecting Outliers in Sensor Data (Part 4)

Full Machine Learning Project — Low-pass Filter & Principal Component Analysis (Part 5a)

Full Machine Learning Project — Low-pass Filter & Principal Component Analysis (Part 5a)

Full Machine Learning Project — Fourier Transformation & Clustering (Part 5b)

Full Machine Learning Project — Fourier Transformation & Clustering (Part 5b)

Full Machine Learning Project — Predictive Modelling (Part 6)

Full Machine Learning Project — Predictive Modelling (Part 6)

Automate Machine Learning with ChatGPT

Automate Machine Learning with ChatGPT

Scraping Web Datasets for Data Science Projects

Scraping Web Datasets for Data Science Projects

Full Machine Learning Project — Counting Repetitions (Part 7)

Full Machine Learning Project — Counting Repetitions (Part 7)

How to Use GitHub Copilot for Data Science (Python + VS Code)

How to Use GitHub Copilot for Data Science (Python + VS Code)

Every Beginner Data Scientist Should Understand This

Every Beginner Data Scientist Should Understand This

Revealing My New AI-Powered Data Science Workflow

Revealing My New AI-Powered Data Science Workflow

Auto-GPT Tutorial - Create Your Personal AI Assistant 🦾

Auto-GPT Tutorial - Create Your Personal AI Assistant 🦾

Build Your Own Auto-GPT Apps with LangChain (Python Tutorial)

Build Your Own Auto-GPT Apps with LangChain (Python Tutorial)

Building Slack AI Assistants with Python & LangChain

Building Slack AI Assistants with Python & LangChain

ChatGPT Code Interpreter - Goodbye Data Analysts?

ChatGPT Code Interpreter - Goodbye Data Analysts?

How to Deploy AI Apps to the Cloud with Flask & Azure

How to Deploy AI Apps to the Cloud with Flask & Azure

How to Build an AI Document Chatbot in 10 Minutes

How to Build an AI Document Chatbot in 10 Minutes

Is Falcon LLM the OpenAI Alternative? An Experimental Setup with LangChain

Is Falcon LLM the OpenAI Alternative? An Experimental Setup with LangChain

GPT Engineer... Generate an entire codebase with one prompt

GPT Engineer... Generate an entire codebase with one prompt

Pandas DataFrame Agent... the future of data analysis?

Pandas DataFrame Agent... the future of data analysis?

OpenAI Function Calling - Full Beginner Tutorial

OpenAI Function Calling - Full Beginner Tutorial

How to use ChatGPT's new “Code Interpreter” feature

How to use ChatGPT's new “Code Interpreter” feature

LangChain just launched their new "LangSmith" platform

LangChain just launched their new "LangSmith" platform

How I'd Learn AI (if I could start over)

How I'd Learn AI (if I could start over)

I Used AI To Scrape The Web & Write PDF Reports

I Used AI To Scrape The Web & Write PDF Reports

LangSmith Tutorial - LLM Evaluation for Beginners

LangSmith Tutorial - LLM Evaluation for Beginners

7 Lessons for New AI Engineers - Beginner’s Guide

7 Lessons for New AI Engineers - Beginner’s Guide

The Rise of the "New-Age" Machine Learning Engineer

The Rise of the "New-Age" Machine Learning Engineer

OpenAI Assistants Tutorial for Beginners

OpenAI Assistants Tutorial for Beginners

How To Connect OpenAI To WhatsApp (Python Tutorial)

How To Connect OpenAI To WhatsApp (Python Tutorial)

How to Build Chatbot Interfaces with Python

How to Build Chatbot Interfaces with Python

PostgreSQL as VectorDB - Beginner Tutorial

PostgreSQL as VectorDB - Beginner Tutorial

My MacBook Setup (as a coder & business owner)

My MacBook Setup (as a coder & business owner)

Easiest Way to Connect AI Chatbots to WhatsApp

Easiest Way to Connect AI Chatbots to WhatsApp

ClickUp Tutorial - What Is ClickUp Brain? 🧠

ClickUp Tutorial - What Is ClickUp Brain? 🧠

My Development Workflow for Data & AI Projects

My Development Workflow for Data & AI Projects

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related AI Lessons

La Evolución de REMI: De Agente Patrimonial a Auditora Externa Autónoma (Mayo 2026)

REMI-IA evoluciona de agente patrimonial a auditora externa autónoma, mejorando la soberanía tecnológica y la descentralización

The Next Evolution of Supply Chains: When AI Starts Thinking, Explaining, and Behaving Like Us

Learn how AI is revolutionizing supply chains by mimicking human thought, explanation, and behavior, and why it matters for business efficiency

The Next Evolution of Supply Chains: When AI Starts Thinking, Explaining, and Behaving Like Us

Learn how AI is revolutionizing supply chains by mimicking human thought, explanation, and behavior, and why it matters for businesses

Medium · Data Science

The Honest Comparison of Hermes vs OpenClaw vs Claude Skills for Product Managers

Learn how Hermes, OpenClaw, and Claude skills compare for product managers and how to apply AI agent frameworks to your work

Chapters (16)

Introduction to Agentic AI Applications

1:54 Understanding LLM Evaluations

4:54 Core Challenges in LLM Development

7:54 Importance of Iteration and Improvement

9:21 Defining Evaluations in AI Systems

11:04 The Analyze, Measure, Improve Cycle

12:26 Levels of Evaluations

14:01 Unit Tests for LLMs

17:53 Human and Model Evaluations

22:44 Aligning LLM Evaluators

29:02 Process for Building Automated Evaluators

31:21 A/B Testing in AI Applications

34:40 Evaluation Metrics Overview

37:25 Common Mistakes to Avoid

39:46 Key Principles for Success

42:24 Conclusion and Next Steps

NEW Gemini Spark AI Agent is INSANE!

Julian Goldie SEO