Evaluating AI Agents: Why It Matters and How We Do It

MLOps.community · Advanced ·🤖 AI Agents & Automation ·7mo ago

Skills: Agent Foundations80%Tool Use & Function Calling70%RAG Evaluation60%

Abstract // As we integrate agentic AI into business products, robust evaluation of the agents is essential to delivering the highest quality. Proper evaluation ensures that AI agents are reliable, safe, effective, and aligned with user intent. Unlike traditional software or machine learning models, AI agents are non-deterministic and require specific types of evaluation. This talk outlines the importance of evaluating AI agents, the key components that we version and test at Acre Security, the metrics that matter for different types of agents, and how we currently achieve success evaluating AI agents that we build at Acre. Bio // Annie Condon// Annie Condon is an AI Solutions Engineer at Acre Security, where she helps bring intelligent systems to the physical access control space—without letting any rogue AI lock people out of a building (on purpose, anyway). With over 8 years of experience across machine learning, data science, and AI, Annie’s journey has taken her from deploying traditional ML models to building cutting-edge AI agents. Jeff Groom// Jeff Groom is an accomplished engineering leader specializing in AI-powered products. As the current Director of Engineering for AI-focused initiatives at Acre Security, he spearheads the development of domain-optimized solutions designed to enhance security across critical infrastructure sectors. With expertise that bridges advanced machine learning techniques and practical engineering execution, Jeff ensures that AI innovation directly aligns with operational and regulatory needs. Prior to his company being acquired by Acre Security, Jeff led engineering teams in the security space, driving architecture, development, deployment, and continuous improvement of AI systems tailored to real-world threat landscapes. Based in Denver, he is active in the AI and security technology community. An MLOps Community Production sponsored by Databricks

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from MLOps.community · MLOps.community · 0 of 60

← Previous Next →

Our 1st MLOps Meetup // Luke Marsden // MLOps Meetup #1

Our 1st MLOps Meetup // Luke Marsden // MLOps Meetup #1

MLOps.community

Remote Collaboration as a Data Scientist

Remote Collaboration as a Data Scientist

MLOps.community

MLOps Manifesto with Luke Marsden from Dotscience

MLOps Manifesto with Luke Marsden from Dotscience

MLOps.community

MLOps lifecycle description

MLOps lifecycle description

MLOps.community

What Does Best in Class AI/ML Governance Look Like in Fin Services? // Charles Radclyffe // MLOps #2

What Does Best in Class AI/ML Governance Look Like in Fin Services? // Charles Radclyffe // MLOps #2

MLOps.community

Life purpose and too many spreadsheets

Life purpose and too many spreadsheets

MLOps.community

Explainability, Black boxes and EU white paper on reproducibility

Explainability, Black boxes and EU white paper on reproducibility

MLOps.community

Hierarchy of Machine Learning Needs // Phil Winder // MLOps Meetup #3

Hierarchy of Machine Learning Needs // Phil Winder // MLOps Meetup #3

MLOps.community

Automatically Retrain Machine Learning Models? Are best practices worth it?

Automatically Retrain Machine Learning Models? Are best practices worth it?

MLOps.community

Building an MLOps Team? Key ideas to keep in mind

Building an MLOps Team? Key ideas to keep in mind

MLOps.community

Hierarchy of MLOps Needs

Hierarchy of MLOps Needs

MLOps.community

Bare necessities for getting an ML model into production

Bare necessities for getting an ML model into production

MLOps.community

MLOps and Monitoring

MLOps and Monitoring

MLOps.community

How Phil Winder got into Data Science and Software Engineering

How Phil Winder got into Data Science and Software Engineering

MLOps.community

Provenance and Reproducibility in Machine Learning; what is it and why you need it?

Provenance and Reproducibility in Machine Learning; what is it and why you need it?

MLOps.community

Friction Between Data Scientists and Software Engineers

Friction Between Data Scientists and Software Engineers

MLOps.community

MLOps Problems in different size companies

MLOps Problems in different size companies

MLOps.community

ML tooling in large companies

ML tooling in large companies

MLOps.community

ML Platforms - The build vs buy question

ML Platforms - The build vs buy question

MLOps.community

ML Services Gateway at SurveyMonkey

ML Services Gateway at SurveyMonkey

MLOps.community

Message buses, Async and sync architecture

Message buses, Async and sync architecture

MLOps.community

MLOps #4: Shubhi Jain - Building an ML Platform @SurveyMonkey

MLOps #4: Shubhi Jain - Building an ML Platform @SurveyMonkey

MLOps.community

Hybrid Data Science Teams @SurveyMonkey

Hybrid Data Science Teams @SurveyMonkey

MLOps.community

How do you handle ML version control at SurveyMonkey

How do you handle ML version control at SurveyMonkey

MLOps.community

Doing ML with Personal Information

Doing ML with Personal Information

MLOps.community

Evolution of the ML feature store @SurveyMonkey

Evolution of the ML feature store @SurveyMonkey

MLOps.community

Developing a Machine Learning Feature Store

Developing a Machine Learning Feature Store

MLOps.community

Auto retrain ML models is not the question

Auto retrain ML models is not the question

MLOps.community

3 key parts to Machine Learning monitoring

3 key parts to Machine Learning monitoring

MLOps.community

MLOps Meetup #6: Mid-Scale Production Feature Engineering with Dr. Venkata Pingali

MLOps Meetup #6: Mid-Scale Production Feature Engineering with Dr. Venkata Pingali

MLOps.community

MLOps meetup #5 High Stakes ML: Active Failures, Latent Factors with Flavio Clesio

MLOps meetup #5 High Stakes ML: Active Failures, Latent Factors with Flavio Clesio

MLOps.community

MLOps: Airflow Pros and Cons

MLOps: Airflow Pros and Cons

MLOps.community

Specific challenges in Machine Learning

Specific challenges in Machine Learning

MLOps.community

Current State Of Machine Learning

Current State Of Machine Learning

MLOps.community

Humans in the Loop are a defining factor in Machine Learning

Humans in the Loop are a defining factor in Machine Learning

MLOps.community

Learning from real life Machine Learning failures

Learning from real life Machine Learning failures

MLOps.community

Survivorship Bias in machine learning tutorials

Survivorship Bias in machine learning tutorials

MLOps.community

Swiss Cheese model in Machine Learning

Swiss Cheese model in Machine Learning

MLOps.community

Resume driven development in Machine learning & software engineering

Resume driven development in Machine learning & software engineering

MLOps.community

Who has the highest standards in ML?

Who has the highest standards in ML?

MLOps.community

Venkata Pingali of Scribble Data Thoughts on the Current State of Machine Learning

Venkata Pingali of Scribble Data Thoughts on the Current State of Machine Learning

MLOps.community

Dependable data and being able to Trust in your Data with Venkata Pengali of Scribble Data

Dependable data and being able to Trust in your Data with Venkata Pengali of Scribble Data

MLOps.community

Speed, Trust, Evolution and Scale in MLOps

Speed, Trust, Evolution and Scale in MLOps

MLOps.community

More difficult transition for data scientists to become ML engineers

More difficult transition for data scientists to become ML engineers

MLOps.community

How many models in prod til I need a dedicated ML platform?

How many models in prod til I need a dedicated ML platform?

MLOps.community

Deeper thinking from data scientists around platform blackholes

Deeper thinking from data scientists around platform blackholes

MLOps.community

Checkpointing, metadata, and confidence in your data

Checkpointing, metadata, and confidence in your data

MLOps.community

Adjacent usecases and multistep feature engineering

Adjacent usecases and multistep feature engineering

MLOps.community

Standardization of Machine Learning tools like in Software Engineering with Venkata Pingali

Standardization of Machine Learning tools like in Software Engineering with Venkata Pingali

MLOps.community

Reproducability flaws in end to end Machine Learning debugging

Reproducability flaws in end to end Machine Learning debugging

MLOps.community

3rd wave of data scientists

3rd wave of data scientists

MLOps.community

MLOps meetup #7 Alex Spanos // TrueLayer 's MLOps Pipeline

MLOps meetup #7 Alex Spanos // TrueLayer 's MLOps Pipeline

MLOps.community

MLOps Meetup #8 Optimizing Your ML Workflow with Kubeflow 1.0

MLOps Meetup #8 Optimizing Your ML Workflow with Kubeflow 1.0

MLOps.community

Are Kubeflow and Airflow complementary?

Are Kubeflow and Airflow complementary?

MLOps.community

Why Kubeflow gained so much traction=open community

Why Kubeflow gained so much traction=open community

MLOps.community

Who decides the dirrection of Kubeflow

Who decides the dirrection of Kubeflow

MLOps.community

What do Kubeflow and Arrikto do and how do they work together?

What do Kubeflow and Arrikto do and how do they work together?

MLOps.community

Versioning your ML steps with Kubeflow

Versioning your ML steps with Kubeflow

MLOps.community

Machine Learning Lifecycles//Perception vs Reality

Machine Learning Lifecycles//Perception vs Reality

MLOps.community

Kubeflow vs SageMaker in Machine Learning

Kubeflow vs SageMaker in Machine Learning

MLOps.community

More on: Agent Foundations

View skill →

Build and Deploy an Agent with Reasoning Engine in Vertex AI

Adding a Phone Gateway to a Virtual Agent

From Zero to Working AI Agent in 60 Seconds

From Zero to Working AI Agent in 60 Seconds

Create An AI Agent With Replit That Automates Your Sales

Create An AI Agent With Replit That Automates Your Sales

Capstone: Autonomous Runway Detection for IoT

Capstone: Autonomous Runway Detection for IoT

AI Agents with Model Context Protocol & Typescript

AI Agents with Model Context Protocol & Typescript

Related AI Lessons

Comparing 6 AI Routers Is a Mistake — Until You Define ‘Survived’

Evaluating AI routers requires a clear definition of success criteria, as comparing them without context is misleading

Comparing 6 AI Routers Is a Mistake — Until You Define ‘Survived’

Evaluating AI routers requires defining survival metrics, as a simple comparison of 6 AI routers can be misleading

Medium · Programming

What if an AI continued thinking even after you closed the chat?

Explore the concept of AI systems that continue thinking after a conversation ends and its implications

Dev.to · Stell

UI/UX is for humans. DX is for developers. AX is for AI agents – and we just built it.

Learn about the concept of AX (Agent Experience) and its importance in designing APIs for AI agents, and how to build and secure APIs for AI interactions

Dev.to · anhmtk

Combine Skills and MCP to Close the Context Gap — Pedro Rodrigues, Supabase