Keynote: The Unbearable Lightness of (Agentic) Evaluations - Besmira Nushi

Name: Keynote: The Unbearable Lightness of (Agentic) Evaluations - Besmira Nushi
Uploaded: 2026-04-20T20:22:20Z
Channel: PyTorch
Description: Keynote: The Unbearable Lightness of (Agentic) Evaluations - Besmira Nushi, Senior Manager, AI Research, NVIDIA The discipline of evaluating large langu...

PyTorch · Intermediate ·🧠 Large Language Models ·3w ago

Skills: LLM Foundations80%AI Alignment Basics60%

Keynote: The Unbearable Lightness of (Agentic) Evaluations - Besmira Nushi, Senior Manager, AI Research, NVIDIA The discipline of evaluating large language models underwent a major transformation with the rise of general AI capabilities. Today, the field is undergoing yet another challenging transformation following the groundbreaking improvements in agentic tasks, which expect models and systems to plan and take autonomous actions in the real world. Measuring how well models and systems perform in such tasks is however still i) fragile from a methodological perspective, and ii) difficult to scale and generalize across different domains. This talk will first discuss common challenges in reproducing agentic evaluations, including differences in reference implementation, error handling, trajectory post processing, and tooling definitions. Next, it will cover infrastructural requirements that need to be addressed for such evaluations to run efficiently at scale. Finally, we will conclude with a set of (still nascent) best practices that can help alleviate “lightness” and build more consistent measurement pipelines.

Watch on YouTube ↗ (saves to browser)