Accelerate agent testing with Evals for Agent Interoperability
Key Takeaways
Evals for Agent Interop is an open-source tool that provides a structured and transparent way to evaluate agents, allowing developers to benchmark their agents across realistic scenarios with confidence. The tool offers curated evaluation suites with synthetic data, configurable rubrics, and granular metrics for systematic and transparent evaluations.
Full Transcript
Hi, I'm Darini Jois, principal product manager at Microsoft. We heard from customers that they struggle to evaluate agents effectively. They often don't have a standard way to measure performance. >> And I'm Adashkan, principal applied scientist. That's why we built Eval for agent interrop, an open-source tool that brings structure and transparency to agent evaluation so you can benchmark across realistic scenarios with confidence. >> We empathize with agent developers who often spend weeks building scenario specific data sets for testing, a process that's both time consuming and diverts focus from actual agent innovation. Evals for agent interrop solves this by providing curated evaluation suites with synthetic data so you can start testing immediately. Plus, you can configure your own rubrics. Whether you care about tone, compliance or accuracy. >> Customers have expressed that even when tests exist, results are often opaque. They can't isolate why an agent passed or failed. With evs for agent interrop evaluations are systematic and transparent. You'll see granular metrics like tool use accuracy, latency, and groundedness. So you know exactly where your agent shines and where it needs work. >> In a world where countless agents claim the same capability, how do you confidently choose the one that truly meets your standards? >> Our leaderboard aggregates results across runs, letting you sort by what matters most. speed, compliance, accuracy, so decisions move from guesswork to datadriven. >> This is just the beginning. Today, we offer a small set of curated scenarios. Soon, eval will be standardized and scalable across a multitude of scenarios. >> Now, let's see eval for agent interop in action. Over to Alistair for a live demo. >> Thanks, Adar. Hi, my name is Alistair. I'm a principal architect at Microsoft. In this demo, I'll walk through how eval for agent interop can help you evaluate, review, and compare agents by benchmarking them against real world scenarios, making your agent development process more efficient. Now, let's jump into evals for agent interop and start by exploring the evaluation data sets. Each data set is curated to test a specific scenario. For example, in this list, we have a data set for testing email collaboration and meeting scheduling. Data sets consist of a set of tests and assertions. Let's drill into the meeting scheduling data set. Within the data set, you can see that we have seven tests. Each test case has the input that will be passed to your agent, the tools that are expected to be called, and details about what is expected from each tool call. These details give you insight into exactly what is being measured, and will be used later to evaluate the agents responses. Now that we've explored the data sets, let's get our agent ready for testing. First, we will navigate to the agents page where the list of registered agents is displayed. Here, we can add a new agent. In our case, I have one running locally that can help with meeting scheduling. Let's register it by clicking register and providing the name, the endpoint where the agent can be called, the model, and a description. Now that we've registered, we can start running an evaluation by clicking run evals. This presents a list of available data sets. Let's pick the meeting scheduling data set. Now the evaluation is running against our agent. In the background, agent evaluator is invoking the agent once for each test case in the data set, passing in appropriate inputs and logging the result. The evaluator is also providing mock tool responses to the agent to ensure that each test run is consistent. To support this, we've modified our sample agent to use agent eval's built-in MCP server. As the results come in, the view will update in real time. And depending on the number of tests, this could take several minutes to complete. After running the evaluation, it's time to review the results. For every test, you'll get a full breakdown, starting with the test description, whether it passed or failed. You can then drill into a specific test case to see the agents actual response, the expected outcome, and why the agent passed or failed the test. In the details view, you can see which tools were called by your agent, the expectation for each tool call, and the evaluator's reasoning for deciding why an assertion passed or failed. This feedback loop is crucial for iterating on your agent and quickly identifying areas for improvement. Now, let's talk about the leaderboard. As you test your agents, you can use the leaderboard to see how they stack up and if they are improving over time. Each evaluation is aggregated and ranked so that you can quickly see how each of your agents are performing against the evaluation data sets. Each row in the leaderboard shows the agents AI model and an overall score summarizing its pass rate across the selected data sets. You can filter data sets to compare agents for specific tasks. If you want the details, clicking on an agent takes you back to the scenario level results, giving you deeper insights into its strengths and weaknesses. That's a quick overview of Evals for agent interop. It's designed to bring clarity and actionable insights to agent development. If you're ready to test your agent or if you want to contribute, try out evals for agent interop by visiting our open source repository on GitHub. Here you can find the application code, the evaluation data sets we are open sourcing, and the details needed to get up and running. We're excited to see what you can build and get your feedback.
Original Description
Introducing Evals for Agent Interop, the way to evaluate those cross-stack connections end to end in realistic scenarios. Evals for Agent Interop provides curated scenarios and representative data that emulate real digital work, along with an evaluation harness that organizations can use to self-run their agents across Microsoft 365 surfaces (Email, Documents, Teams, Calendar, and more). It’s designed to be simple to start, yet capable enough to reveal quality, efficiency, robustness, and user experience tradeoffs between agent implementations, so organizations can make informed choices quickly.
https://aka.ms/EvalsForAgentInterop
https://learn.microsoft.com/en-us/microsoft-agent-365/
https://learn.microsoft.com/en-us/microsoft-agent-365/tooling-servers-overview?utm_source=chatgpt.com
https://devblogs.microsoft.com/microsoft365dev/from-innovation-to-enterprise-trust-with-microsoft-agent-365/
https://devblogs.microsoft.com/microsoft365dev/microsoft-agent-365-interoperability-for-smart-secure-productivity/
https://www.microsoft.com/en-us/microsoft-agent-365?msockid=3535fcba82d669720766ed1c8358686d
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Microsoft 365 Developer · Microsoft 365 Developer · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Adaptive Cards community call-February 2019
Microsoft 365 Developer
PowerApps community call-February 2019
Microsoft 365 Developer
Microsoft Graph community call-March 2019
Microsoft 365 Developer
Office Add ins community call-March 2019
Microsoft 365 Developer
PowerApps community call-March 2019
Microsoft 365 Developer
Microsoft Teams community call-March 2019
Microsoft 365 Developer
Using React and Office UI Fabric React Components
Microsoft 365 Developer
Build Microsoft Teams customization using SharePoint Framework
Microsoft 365 Developer
Microsoft Graph community call-April 2019
Microsoft 365 Developer
Using Change Notifications and Track Changes with Microsoft Graph
Microsoft 365 Developer
Office Add Ins community call-April 2019
Microsoft 365 Developer
Adaptive Cards community call-April 2019
Microsoft 365 Developer
Microsoft Teams community call-April 2019
Microsoft 365 Developer
Getting Started with Microsoft Graph and Application Registration
Microsoft 365 Developer
Getting Started with Microsoft Graph and the Directory API
Microsoft 365 Developer
Getting Started with Microsoft Graph and Microsoft Teams
Microsoft 365 Developer
Getting Started with Microsoft Graph Explorer
Microsoft 365 Developer
Getting Started with Microsoft Graph
Microsoft 365 Developer
Getting Started with Microsoft Graph and Mail API
Microsoft 365 Developer
Getting Started with Microsoft Graph and Office 365 Groups
Microsoft 365 Developer
Getting Started with Microsoft Graph and the Calendar API
Microsoft 365 Developer
Getting Started with the Microsoft Graph Toolkit
Microsoft 365 Developer
Getting Started with Microsoft Graph and JavaScript SDKs
Microsoft 365 Developer
Getting Started with Microsoft Graph and .NET SDKs
Microsoft 365 Developer
Discover how businesses can be more productive with Microsoft 365 integrations
Microsoft 365 Developer
Adaptive Cards community call-May 2019
Microsoft 365 Developer
Office Add-ins community call-May 2019
Microsoft 365 Developer
Why We Built on Microsoft Teams
Microsoft 365 Developer
Microsoft Teams community call-May 2019
Microsoft 365 Developer
Microsoft Graph community call-June 2019
Microsoft 365 Developer
Build Angular SPA's with Microsoft Graph - June 2019
Microsoft 365 Developer
Office Add -ins community call-June 2019
Microsoft 365 Developer
Build Android native apps with the Microsoft Graph Android SDK - June 2019
Microsoft 365 Developer
Build MVC apps with Microsoft Graph - June 2019
Microsoft 365 Developer
Authenticate and connect with Microsoft Graph - June 2019
Microsoft 365 Developer
Microsoft Graph data connect - June 2019
Microsoft 365 Developer
Change notifications with Microsoft Graph - June 2019
Microsoft 365 Developer
Build iOS native apps with the Microsoft Graph REST API - June 2019
Microsoft 365 Developer
Build Node.js Express apps with Microsoft Graph - June 2019
Microsoft 365 Developer
Smart UI with Microsoft Graph - June 2019
Microsoft 365 Developer
Leveraging the Microsoft Graph API from the SharePoint Framework - June 2019
Microsoft 365 Developer
Build UWP apps with Microsoft Graph - June 2019
Microsoft 365 Developer
Build React SPA's with Microsoft Graph - June 2019
Microsoft 365 Developer
Getting Started with Microsoft Graph and Batching
Microsoft 365 Developer
Getting Started with Microsoft Graph and Change Notifications
Microsoft 365 Developer
Getting Started with Microsoft Graph and Consent Permissions
Microsoft 365 Developer
Getting Started with Microsoft Graph and Education
Microsoft 365 Developer
Getting Started with Microsoft Graph and Financials
Microsoft 365 Developer
Getting Started with Microsoft Graph and Excel
Microsoft 365 Developer
Getting Started with Microsoft Graph and Data Connect
Microsoft 365 Developer
Getting Started with Microsoft Graph and Intune
Microsoft 365 Developer
Getting Started with Microsoft Graph and Notifications
Microsoft 365 Developer
Getting Started with Microsoft Graph and OneNote
Microsoft 365 Developer
Getting Started with Microsoft Graph and OneDrive
Microsoft 365 Developer
Getting Started with Microsoft Graph and Open Extensions
Microsoft 365 Developer
Getting Started with Microsoft Graph and Paging
Microsoft 365 Developer
Getting Started with Microsoft Graph and Schema Extensions
Microsoft 365 Developer
Getting Started with Microsoft Graph and Security API
Microsoft 365 Developer
Getting Started with Microsoft Graph and Query Parameters
Microsoft 365 Developer
Getting Started with Microsoft Graph and Reporting API
Microsoft 365 Developer
More on: Agent Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss
Medium · Machine Learning
Stop Guessing: Guaranteed Structured Output from LLMs in Node.js
Dev.to · Hardik Mehta
Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)
Dev.to AI
Notes: Memory, Context, and Large Language Models (LLMs)
Dev.to · Vladimir Panov
🎓
Tutor Explanation
DeepCamp AI