Accelerate agent testing with Evals for Agent Interoperability

Microsoft 365 Developer · Beginner ·🧠 Large Language Models ·5mo ago

Skills: Agent Foundations80%Tool Use & Function Calling70%

Key Takeaways

Evals for Agent Interop is an open-source tool that provides a structured and transparent way to evaluate agents, allowing developers to benchmark their agents across realistic scenarios with confidence. The tool offers curated evaluation suites with synthetic data, configurable rubrics, and granular metrics for systematic and transparent evaluations.

Full Transcript

Hi, I'm Darini Jois, principal product manager at Microsoft. We heard from customers that they struggle to evaluate agents effectively. They often don't have a standard way to measure performance. >> And I'm Adashkan, principal applied scientist. That's why we built Eval for agent interrop, an open-source tool that brings structure and transparency to agent evaluation so you can benchmark across realistic scenarios with confidence. >> We empathize with agent developers who often spend weeks building scenario specific data sets for testing, a process that's both time consuming and diverts focus from actual agent innovation. Evals for agent interrop solves this by providing curated evaluation suites with synthetic data so you can start testing immediately. Plus, you can configure your own rubrics. Whether you care about tone, compliance or accuracy. >> Customers have expressed that even when tests exist, results are often opaque. They can't isolate why an agent passed or failed. With evs for agent interrop evaluations are systematic and transparent. You'll see granular metrics like tool use accuracy, latency, and groundedness. So you know exactly where your agent shines and where it needs work. >> In a world where countless agents claim the same capability, how do you confidently choose the one that truly meets your standards? >> Our leaderboard aggregates results across runs, letting you sort by what matters most. speed, compliance, accuracy, so decisions move from guesswork to datadriven. >> This is just the beginning. Today, we offer a small set of curated scenarios. Soon, eval will be standardized and scalable across a multitude of scenarios. >> Now, let's see eval for agent interop in action. Over to Alistair for a live demo. >> Thanks, Adar. Hi, my name is Alistair. I'm a principal architect at Microsoft. In this demo, I'll walk through how eval for agent interop can help you evaluate, review, and compare agents by benchmarking them against real world scenarios, making your agent development process more efficient. Now, let's jump into evals for agent interop and start by exploring the evaluation data sets. Each data set is curated to test a specific scenario. For example, in this list, we have a data set for testing email collaboration and meeting scheduling. Data sets consist of a set of tests and assertions. Let's drill into the meeting scheduling data set. Within the data set, you can see that we have seven tests. Each test case has the input that will be passed to your agent, the tools that are expected to be called, and details about what is expected from each tool call. These details give you insight into exactly what is being measured, and will be used later to evaluate the agents responses. Now that we've explored the data sets, let's get our agent ready for testing. First, we will navigate to the agents page where the list of registered agents is displayed. Here, we can add a new agent. In our case, I have one running locally that can help with meeting scheduling. Let's register it by clicking register and providing the name, the endpoint where the agent can be called, the model, and a description. Now that we've registered, we can start running an evaluation by clicking run evals. This presents a list of available data sets. Let's pick the meeting scheduling data set. Now the evaluation is running against our agent. In the background, agent evaluator is invoking the agent once for each test case in the data set, passing in appropriate inputs and logging the result. The evaluator is also providing mock tool responses to the agent to ensure that each test run is consistent. To support this, we've modified our sample agent to use agent eval's built-in MCP server. As the results come in, the view will update in real time. And depending on the number of tests, this could take several minutes to complete. After running the evaluation, it's time to review the results. For every test, you'll get a full breakdown, starting with the test description, whether it passed or failed. You can then drill into a specific test case to see the agents actual response, the expected outcome, and why the agent passed or failed the test. In the details view, you can see which tools were called by your agent, the expectation for each tool call, and the evaluator's reasoning for deciding why an assertion passed or failed. This feedback loop is crucial for iterating on your agent and quickly identifying areas for improvement. Now, let's talk about the leaderboard. As you test your agents, you can use the leaderboard to see how they stack up and if they are improving over time. Each evaluation is aggregated and ranked so that you can quickly see how each of your agents are performing against the evaluation data sets. Each row in the leaderboard shows the agents AI model and an overall score summarizing its pass rate across the selected data sets. You can filter data sets to compare agents for specific tasks. If you want the details, clicking on an agent takes you back to the scenario level results, giving you deeper insights into its strengths and weaknesses. That's a quick overview of Evals for agent interop. It's designed to bring clarity and actionable insights to agent development. If you're ready to test your agent or if you want to contribute, try out evals for agent interop by visiting our open source repository on GitHub. Here you can find the application code, the evaluation data sets we are open sourcing, and the details needed to get up and running. We're excited to see what you can build and get your feedback.

Original Description

Introducing Evals for Agent Interop, the way to evaluate those cross-stack connections end to end in realistic scenarios. Evals for Agent Interop provides curated scenarios and representative data that emulate real digital work, along with an evaluation harness that organizations can use to self-run their agents across Microsoft 365 surfaces (Email, Documents, Teams, Calendar, and more). It’s designed to be simple to start, yet capable enough to reveal quality, efficiency, robustness, and user experience tradeoffs between agent implementations, so organizations can make informed choices quickly. https://aka.ms/EvalsForAgentInterop https://learn.microsoft.com/en-us/microsoft-agent-365/ https://learn.microsoft.com/en-us/microsoft-agent-365/tooling-servers-overview?utm_source=chatgpt.com https://devblogs.microsoft.com/microsoft365dev/from-innovation-to-enterprise-trust-with-microsoft-agent-365/ https://devblogs.microsoft.com/microsoft365dev/microsoft-agent-365-interoperability-for-smart-secure-productivity/ https://www.microsoft.com/en-us/microsoft-agent-365?msockid=3535fcba82d669720766ed1c8358686d

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft 365 Developer · Microsoft 365 Developer · 0 of 60

← Previous Next →

Adaptive Cards community call-February 2019

Adaptive Cards community call-February 2019

Microsoft 365 Developer

PowerApps community call-February 2019

PowerApps community call-February 2019

Microsoft 365 Developer

Microsoft Graph community call-March 2019

Microsoft Graph community call-March 2019

Microsoft 365 Developer

Office Add ins community call-March 2019

Office Add ins community call-March 2019

Microsoft 365 Developer

PowerApps community call-March 2019

PowerApps community call-March 2019

Microsoft 365 Developer

Microsoft Teams community call-March 2019

Microsoft Teams community call-March 2019

Microsoft 365 Developer

Using React and Office UI Fabric React Components

Using React and Office UI Fabric React Components

Microsoft 365 Developer

Build Microsoft Teams customization using SharePoint Framework

Build Microsoft Teams customization using SharePoint Framework

Microsoft 365 Developer

Microsoft Graph community call-April 2019

Microsoft Graph community call-April 2019

Microsoft 365 Developer

Using Change Notifications and Track Changes with Microsoft Graph

Using Change Notifications and Track Changes with Microsoft Graph

Microsoft 365 Developer

Office Add Ins community call-April 2019

Office Add Ins community call-April 2019

Microsoft 365 Developer

Adaptive Cards community call-April 2019

Adaptive Cards community call-April 2019

Microsoft 365 Developer

Microsoft Teams community call-April 2019

Microsoft Teams community call-April 2019

Microsoft 365 Developer

Getting Started with Microsoft Graph and Application Registration

Getting Started with Microsoft Graph and Application Registration

Microsoft 365 Developer

Getting Started with Microsoft Graph and the Directory API

Getting Started with Microsoft Graph and the Directory API

Microsoft 365 Developer

Getting Started with Microsoft Graph and Microsoft Teams

Getting Started with Microsoft Graph and Microsoft Teams

Microsoft 365 Developer

Getting Started with Microsoft Graph Explorer

Getting Started with Microsoft Graph Explorer

Microsoft 365 Developer

Getting Started with Microsoft Graph

Getting Started with Microsoft Graph

Microsoft 365 Developer

Getting Started with Microsoft Graph and Mail API

Getting Started with Microsoft Graph and Mail API

Microsoft 365 Developer

Getting Started with Microsoft Graph and Office 365 Groups

Getting Started with Microsoft Graph and Office 365 Groups

Microsoft 365 Developer

Getting Started with Microsoft Graph and the Calendar API

Getting Started with Microsoft Graph and the Calendar API

Microsoft 365 Developer

Getting Started with the Microsoft Graph Toolkit

Getting Started with the Microsoft Graph Toolkit

Microsoft 365 Developer

Getting Started with Microsoft Graph and JavaScript SDKs

Getting Started with Microsoft Graph and JavaScript SDKs

Microsoft 365 Developer

Getting Started with Microsoft Graph and .NET SDKs

Getting Started with Microsoft Graph and .NET SDKs

Microsoft 365 Developer

Discover how businesses can be more productive with Microsoft 365 integrations

Discover how businesses can be more productive with Microsoft 365 integrations

Microsoft 365 Developer

Adaptive Cards community call-May 2019

Adaptive Cards community call-May 2019

Microsoft 365 Developer

Office Add-ins community call-May 2019

Office Add-ins community call-May 2019

Microsoft 365 Developer

Why We Built on Microsoft Teams

Why We Built on Microsoft Teams

Microsoft 365 Developer

Microsoft Teams community call-May 2019

Microsoft Teams community call-May 2019

Microsoft 365 Developer

Microsoft Graph community call-June 2019

Microsoft Graph community call-June 2019

Microsoft 365 Developer

Build Angular SPA's with Microsoft Graph - June 2019

Build Angular SPA's with Microsoft Graph - June 2019

Microsoft 365 Developer

Office Add -ins community call-June 2019

Office Add -ins community call-June 2019

Microsoft 365 Developer

Build Android native apps with the Microsoft Graph Android SDK - June 2019

Build Android native apps with the Microsoft Graph Android SDK - June 2019

Microsoft 365 Developer

Build MVC apps with Microsoft Graph - June 2019

Build MVC apps with Microsoft Graph - June 2019

Microsoft 365 Developer

Authenticate and connect with Microsoft Graph - June 2019

Authenticate and connect with Microsoft Graph - June 2019

Microsoft 365 Developer

Microsoft Graph data connect - June 2019

Microsoft Graph data connect - June 2019

Microsoft 365 Developer

Change notifications with Microsoft Graph - June 2019

Change notifications with Microsoft Graph - June 2019

Microsoft 365 Developer

Build iOS native apps with the Microsoft Graph REST API - June 2019

Build iOS native apps with the Microsoft Graph REST API - June 2019

Microsoft 365 Developer

Build Node.js Express apps with Microsoft Graph - June 2019

Build Node.js Express apps with Microsoft Graph - June 2019

Microsoft 365 Developer

Smart UI with Microsoft Graph - June 2019

Smart UI with Microsoft Graph - June 2019

Microsoft 365 Developer

Leveraging the Microsoft Graph API from the SharePoint Framework - June 2019

Leveraging the Microsoft Graph API from the SharePoint Framework - June 2019

Microsoft 365 Developer

Build UWP apps with Microsoft Graph - June 2019

Build UWP apps with Microsoft Graph - June 2019

Microsoft 365 Developer

Build React SPA's with Microsoft Graph - June 2019

Build React SPA's with Microsoft Graph - June 2019

Microsoft 365 Developer

Getting Started with Microsoft Graph and Batching

Getting Started with Microsoft Graph and Batching

Microsoft 365 Developer

Getting Started with Microsoft Graph and Change Notifications

Getting Started with Microsoft Graph and Change Notifications

Microsoft 365 Developer

Getting Started with Microsoft Graph and Consent Permissions

Getting Started with Microsoft Graph and Consent Permissions

Microsoft 365 Developer

Getting Started with Microsoft Graph and Education

Getting Started with Microsoft Graph and Education

Microsoft 365 Developer

Getting Started with Microsoft Graph and Financials

Getting Started with Microsoft Graph and Financials

Microsoft 365 Developer

Getting Started with Microsoft Graph and Excel

Getting Started with Microsoft Graph and Excel

Microsoft 365 Developer

Getting Started with Microsoft Graph and Data Connect

Getting Started with Microsoft Graph and Data Connect

Microsoft 365 Developer

Getting Started with Microsoft Graph and Intune

Getting Started with Microsoft Graph and Intune

Microsoft 365 Developer

Getting Started with Microsoft Graph and Notifications

Getting Started with Microsoft Graph and Notifications

Microsoft 365 Developer

Getting Started with Microsoft Graph and OneNote

Getting Started with Microsoft Graph and OneNote

Microsoft 365 Developer

Getting Started with Microsoft Graph and OneDrive

Getting Started with Microsoft Graph and OneDrive

Microsoft 365 Developer

Getting Started with Microsoft Graph and Open Extensions

Getting Started with Microsoft Graph and Open Extensions

Microsoft 365 Developer

Getting Started with Microsoft Graph and Paging

Getting Started with Microsoft Graph and Paging

Microsoft 365 Developer

Getting Started with Microsoft Graph and Schema Extensions

Getting Started with Microsoft Graph and Schema Extensions

Microsoft 365 Developer

Getting Started with Microsoft Graph and Security API

Getting Started with Microsoft Graph and Security API

Microsoft 365 Developer

Getting Started with Microsoft Graph and Query Parameters

Getting Started with Microsoft Graph and Query Parameters

Microsoft 365 Developer

Getting Started with Microsoft Graph and Reporting API

Getting Started with Microsoft Graph and Reporting API

Microsoft 365 Developer

Evals for Agent Interop is an open-source tool that provides a structured and transparent way to evaluate agents, allowing developers to benchmark their agents across realistic scenarios with confidence. The tool offers curated evaluation suites with synthetic data, configurable rubrics, and granular metrics for systematic and transparent evaluations. By using Evals for Agent Interop, developers can quickly identify areas for improvement and iterate on their agents to achieve better performance.

Key Takeaways

Explore the evaluation data sets
Register an agent for testing
Run an evaluation against the agent
Review the evaluation results
Use the leaderboard to compare agent performance

💡 Evals for Agent Interop provides a systematic and transparent way to evaluate agents, allowing developers to benchmark their agents across realistic scenarios with confidence.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Agent Foundations

View skill →

Build and Deploy an Agent with Reasoning Engine in Vertex AI

Adding a Phone Gateway to a Virtual Agent

From Zero to Working AI Agent in 60 Seconds

From Zero to Working AI Agent in 60 Seconds

Create An AI Agent With Replit That Automates Your Sales

Create An AI Agent With Replit That Automates Your Sales

Capstone: Autonomous Runway Detection for IoT

Capstone: Autonomous Runway Detection for IoT

AI Agents with Model Context Protocol & Typescript

AI Agents with Model Context Protocol & Typescript

Related AI Lessons

Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss

Learn how to accelerate AI workflows with on-device semantic search using Moss, achieving sub-10ms response times and improving user experience

Medium · Machine Learning

Stop Guessing: Guaranteed Structured Output from LLMs in Node.js

Learn to guarantee structured output from LLMs in Node.js and stop parsing JSON manually

Dev.to · Hardik Mehta

Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)

Build a REST endpoint with Spring Boot 3 and OpenAI to create an LLM-powered API, leveraging the power of AI in your applications

Notes: Memory, Context, and Large Language Models (LLMs)

Learn how memory and context work in Large Language Models (LLMs) and potential improvements

Dev.to · Vladimir Panov

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)