Accelerate agent testing with Evals for Agent Interoperability

Microsoft 365 Developer · Beginner ·🧠 Large Language Models ·5mo ago

Key Takeaways

Evals for Agent Interop is an open-source tool that provides a structured and transparent way to evaluate agents, allowing developers to benchmark their agents across realistic scenarios with confidence. The tool offers curated evaluation suites with synthetic data, configurable rubrics, and granular metrics for systematic and transparent evaluations.

Full Transcript

Hi, I'm Darini Jois, principal product manager at Microsoft. We heard from customers that they struggle to evaluate agents effectively. They often don't have a standard way to measure performance. >> And I'm Adashkan, principal applied scientist. That's why we built Eval for agent interrop, an open-source tool that brings structure and transparency to agent evaluation so you can benchmark across realistic scenarios with confidence. >> We empathize with agent developers who often spend weeks building scenario specific data sets for testing, a process that's both time consuming and diverts focus from actual agent innovation. Evals for agent interrop solves this by providing curated evaluation suites with synthetic data so you can start testing immediately. Plus, you can configure your own rubrics. Whether you care about tone, compliance or accuracy. >> Customers have expressed that even when tests exist, results are often opaque. They can't isolate why an agent passed or failed. With evs for agent interrop evaluations are systematic and transparent. You'll see granular metrics like tool use accuracy, latency, and groundedness. So you know exactly where your agent shines and where it needs work. >> In a world where countless agents claim the same capability, how do you confidently choose the one that truly meets your standards? >> Our leaderboard aggregates results across runs, letting you sort by what matters most. speed, compliance, accuracy, so decisions move from guesswork to datadriven. >> This is just the beginning. Today, we offer a small set of curated scenarios. Soon, eval will be standardized and scalable across a multitude of scenarios. >> Now, let's see eval for agent interop in action. Over to Alistair for a live demo. >> Thanks, Adar. Hi, my name is Alistair. I'm a principal architect at Microsoft. In this demo, I'll walk through how eval for agent interop can help you evaluate, review, and compare agents by benchmarking them against real world scenarios, making your agent development process more efficient. Now, let's jump into evals for agent interop and start by exploring the evaluation data sets. Each data set is curated to test a specific scenario. For example, in this list, we have a data set for testing email collaboration and meeting scheduling. Data sets consist of a set of tests and assertions. Let's drill into the meeting scheduling data set. Within the data set, you can see that we have seven tests. Each test case has the input that will be passed to your agent, the tools that are expected to be called, and details about what is expected from each tool call. These details give you insight into exactly what is being measured, and will be used later to evaluate the agents responses. Now that we've explored the data sets, let's get our agent ready for testing. First, we will navigate to the agents page where the list of registered agents is displayed. Here, we can add a new agent. In our case, I have one running locally that can help with meeting scheduling. Let's register it by clicking register and providing the name, the endpoint where the agent can be called, the model, and a description. Now that we've registered, we can start running an evaluation by clicking run evals. This presents a list of available data sets. Let's pick the meeting scheduling data set. Now the evaluation is running against our agent. In the background, agent evaluator is invoking the agent once for each test case in the data set, passing in appropriate inputs and logging the result. The evaluator is also providing mock tool responses to the agent to ensure that each test run is consistent. To support this, we've modified our sample agent to use agent eval's built-in MCP server. As the results come in, the view will update in real time. And depending on the number of tests, this could take several minutes to complete. After running the evaluation, it's time to review the results. For every test, you'll get a full breakdown, starting with the test description, whether it passed or failed. You can then drill into a specific test case to see the agents actual response, the expected outcome, and why the agent passed or failed the test. In the details view, you can see which tools were called by your agent, the expectation for each tool call, and the evaluator's reasoning for deciding why an assertion passed or failed. This feedback loop is crucial for iterating on your agent and quickly identifying areas for improvement. Now, let's talk about the leaderboard. As you test your agents, you can use the leaderboard to see how they stack up and if they are improving over time. Each evaluation is aggregated and ranked so that you can quickly see how each of your agents are performing against the evaluation data sets. Each row in the leaderboard shows the agents AI model and an overall score summarizing its pass rate across the selected data sets. You can filter data sets to compare agents for specific tasks. If you want the details, clicking on an agent takes you back to the scenario level results, giving you deeper insights into its strengths and weaknesses. That's a quick overview of Evals for agent interop. It's designed to bring clarity and actionable insights to agent development. If you're ready to test your agent or if you want to contribute, try out evals for agent interop by visiting our open source repository on GitHub. Here you can find the application code, the evaluation data sets we are open sourcing, and the details needed to get up and running. We're excited to see what you can build and get your feedback.

Original Description

Introducing Evals for Agent Interop, the way to evaluate those cross-stack connections end to end in realistic scenarios. Evals for Agent Interop provides curated scenarios and representative data that emulate real digital work, along with an evaluation harness that organizations can use to self-run their agents across Microsoft 365 surfaces (Email, Documents, Teams, Calendar, and more). It’s designed to be simple to start, yet capable enough to reveal quality, efficiency, robustness, and user experience tradeoffs between agent implementations, so organizations can make informed choices quickly. https://aka.ms/EvalsForAgentInterop https://learn.microsoft.com/en-us/microsoft-agent-365/ https://learn.microsoft.com/en-us/microsoft-agent-365/tooling-servers-overview?utm_source=chatgpt.com https://devblogs.microsoft.com/microsoft365dev/from-innovation-to-enterprise-trust-with-microsoft-agent-365/ https://devblogs.microsoft.com/microsoft365dev/microsoft-agent-365-interoperability-for-smart-secure-productivity/ https://www.microsoft.com/en-us/microsoft-agent-365?msockid=3535fcba82d669720766ed1c8358686d
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft 365 Developer · Microsoft 365 Developer · 0 of 60

← Previous Next →
1 Adaptive Cards community call-February 2019
Adaptive Cards community call-February 2019
Microsoft 365 Developer
2 PowerApps community call-February 2019
PowerApps community call-February 2019
Microsoft 365 Developer
3 Microsoft Graph community call-March 2019
Microsoft Graph community call-March 2019
Microsoft 365 Developer
4 Office Add ins community call-March 2019
Office Add ins community call-March 2019
Microsoft 365 Developer
5 PowerApps community call-March 2019
PowerApps community call-March 2019
Microsoft 365 Developer
6 Microsoft Teams community call-March 2019
Microsoft Teams community call-March 2019
Microsoft 365 Developer
7 Using React and Office UI Fabric React Components
Using React and Office UI Fabric React Components
Microsoft 365 Developer
8 Build Microsoft Teams customization using SharePoint Framework
Build Microsoft Teams customization using SharePoint Framework
Microsoft 365 Developer
9 Microsoft Graph community call-April 2019
Microsoft Graph community call-April 2019
Microsoft 365 Developer
10 Using Change Notifications and Track Changes with Microsoft Graph
Using Change Notifications and Track Changes with Microsoft Graph
Microsoft 365 Developer
11 Office Add Ins community call-April 2019
Office Add Ins community call-April 2019
Microsoft 365 Developer
12 Adaptive Cards community call-April 2019
Adaptive Cards community call-April 2019
Microsoft 365 Developer
13 Microsoft Teams community call-April 2019
Microsoft Teams community call-April 2019
Microsoft 365 Developer
14 Getting Started with Microsoft Graph and Application Registration
Getting Started with Microsoft Graph and Application Registration
Microsoft 365 Developer
15 Getting Started with Microsoft Graph and the Directory API
Getting Started with Microsoft Graph and the Directory API
Microsoft 365 Developer
16 Getting Started with Microsoft Graph and Microsoft Teams
Getting Started with Microsoft Graph and Microsoft Teams
Microsoft 365 Developer
17 Getting Started with Microsoft Graph Explorer
Getting Started with Microsoft Graph Explorer
Microsoft 365 Developer
18 Getting Started with Microsoft Graph
Getting Started with Microsoft Graph
Microsoft 365 Developer
19 Getting Started with Microsoft Graph and Mail API
Getting Started with Microsoft Graph and Mail API
Microsoft 365 Developer
20 Getting Started with Microsoft Graph and Office 365 Groups
Getting Started with Microsoft Graph and Office 365 Groups
Microsoft 365 Developer
21 Getting Started with Microsoft Graph and the Calendar API
Getting Started with Microsoft Graph and the Calendar API
Microsoft 365 Developer
22 Getting Started with the Microsoft Graph Toolkit
Getting Started with the Microsoft Graph Toolkit
Microsoft 365 Developer
23 Getting Started with Microsoft Graph and JavaScript SDKs
Getting Started with Microsoft Graph and JavaScript SDKs
Microsoft 365 Developer
24 Getting Started with Microsoft Graph and .NET SDKs
Getting Started with Microsoft Graph and .NET SDKs
Microsoft 365 Developer
25 Discover how businesses can be more productive with Microsoft 365 integrations
Discover how businesses can be more productive with Microsoft 365 integrations
Microsoft 365 Developer
26 Adaptive Cards community call-May 2019
Adaptive Cards community call-May 2019
Microsoft 365 Developer
27 Office Add-ins community call-May 2019
Office Add-ins community call-May 2019
Microsoft 365 Developer
28 Why We Built on Microsoft Teams
Why We Built on Microsoft Teams
Microsoft 365 Developer
29 Microsoft Teams community call-May 2019
Microsoft Teams community call-May 2019
Microsoft 365 Developer
30 Microsoft Graph community call-June 2019
Microsoft Graph community call-June 2019
Microsoft 365 Developer
31 Build Angular SPA's with Microsoft Graph - June 2019
Build Angular SPA's with Microsoft Graph - June 2019
Microsoft 365 Developer
32 Office Add -ins community call-June 2019
Office Add -ins community call-June 2019
Microsoft 365 Developer
33 Build Android native apps with the Microsoft Graph Android SDK - June 2019
Build Android native apps with the Microsoft Graph Android SDK - June 2019
Microsoft 365 Developer
34 Build MVC apps with Microsoft Graph - June 2019
Build MVC apps with Microsoft Graph - June 2019
Microsoft 365 Developer
35 Authenticate and connect with Microsoft Graph - June 2019
Authenticate and connect with Microsoft Graph - June 2019
Microsoft 365 Developer
36 Microsoft Graph data connect - June 2019
Microsoft Graph data connect - June 2019
Microsoft 365 Developer
37 Change notifications with Microsoft Graph - June 2019
Change notifications with Microsoft Graph - June 2019
Microsoft 365 Developer
38 Build iOS native apps with the Microsoft Graph REST API - June 2019
Build iOS native apps with the Microsoft Graph REST API - June 2019
Microsoft 365 Developer
39 Build Node.js Express apps with Microsoft Graph - June 2019
Build Node.js Express apps with Microsoft Graph - June 2019
Microsoft 365 Developer
40 Smart UI with Microsoft Graph - June 2019
Smart UI with Microsoft Graph - June 2019
Microsoft 365 Developer
41 Leveraging the Microsoft Graph API from the SharePoint Framework - June 2019
Leveraging the Microsoft Graph API from the SharePoint Framework - June 2019
Microsoft 365 Developer
42 Build UWP apps with Microsoft Graph - June 2019
Build UWP apps with Microsoft Graph - June 2019
Microsoft 365 Developer
43 Build React SPA's with Microsoft Graph - June 2019
Build React SPA's with Microsoft Graph - June 2019
Microsoft 365 Developer
44 Getting Started with Microsoft Graph and Batching
Getting Started with Microsoft Graph and Batching
Microsoft 365 Developer
45 Getting Started with Microsoft Graph and Change Notifications
Getting Started with Microsoft Graph and Change Notifications
Microsoft 365 Developer
46 Getting Started with Microsoft Graph and Consent Permissions
Getting Started with Microsoft Graph and Consent Permissions
Microsoft 365 Developer
47 Getting Started with Microsoft Graph and Education
Getting Started with Microsoft Graph and Education
Microsoft 365 Developer
48 Getting Started with Microsoft Graph and Financials
Getting Started with Microsoft Graph and Financials
Microsoft 365 Developer
49 Getting Started with Microsoft Graph and Excel
Getting Started with Microsoft Graph and Excel
Microsoft 365 Developer
50 Getting Started with Microsoft Graph and Data Connect
Getting Started with Microsoft Graph and Data Connect
Microsoft 365 Developer
51 Getting Started with Microsoft Graph and Intune
Getting Started with Microsoft Graph and Intune
Microsoft 365 Developer
52 Getting Started with Microsoft Graph and Notifications
Getting Started with Microsoft Graph and Notifications
Microsoft 365 Developer
53 Getting Started with Microsoft Graph and OneNote
Getting Started with Microsoft Graph and OneNote
Microsoft 365 Developer
54 Getting Started with Microsoft Graph and OneDrive
Getting Started with Microsoft Graph and OneDrive
Microsoft 365 Developer
55 Getting Started with Microsoft Graph and Open Extensions
Getting Started with Microsoft Graph and Open Extensions
Microsoft 365 Developer
56 Getting Started with Microsoft Graph and Paging
Getting Started with Microsoft Graph and Paging
Microsoft 365 Developer
57 Getting Started with Microsoft Graph and Schema Extensions
Getting Started with Microsoft Graph and Schema Extensions
Microsoft 365 Developer
58 Getting Started with Microsoft Graph and Security API
Getting Started with Microsoft Graph and Security API
Microsoft 365 Developer
59 Getting Started with Microsoft Graph and Query Parameters
Getting Started with Microsoft Graph and Query Parameters
Microsoft 365 Developer
60 Getting Started with Microsoft Graph and Reporting API
Getting Started with Microsoft Graph and Reporting API
Microsoft 365 Developer

Evals for Agent Interop is an open-source tool that provides a structured and transparent way to evaluate agents, allowing developers to benchmark their agents across realistic scenarios with confidence. The tool offers curated evaluation suites with synthetic data, configurable rubrics, and granular metrics for systematic and transparent evaluations. By using Evals for Agent Interop, developers can quickly identify areas for improvement and iterate on their agents to achieve better performance.

Key Takeaways
  1. Explore the evaluation data sets
  2. Register an agent for testing
  3. Run an evaluation against the agent
  4. Review the evaluation results
  5. Use the leaderboard to compare agent performance
💡 Evals for Agent Interop provides a systematic and transparent way to evaluate agents, allowing developers to benchmark their agents across realistic scenarios with confidence.

Related AI Lessons

Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss
Learn how to accelerate AI workflows with on-device semantic search using Moss, achieving sub-10ms response times and improving user experience
Medium · Machine Learning
Stop Guessing: Guaranteed Structured Output from LLMs in Node.js
Learn to guarantee structured output from LLMs in Node.js and stop parsing JSON manually
Dev.to · Hardik Mehta
Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)
Build a REST endpoint with Spring Boot 3 and OpenAI to create an LLM-powered API, leveraging the power of AI in your applications
Dev.to AI
Notes: Memory, Context, and Large Language Models (LLMs)
Learn how memory and context work in Large Language Models (LLMs) and potential improvements
Dev.to · Vladimir Panov
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →