Keynote: The Unbearable Lightness of (Agentic) Evaluations - Besmira Nushi
Keynote: The Unbearable Lightness of (Agentic) Evaluations - Besmira Nushi, Senior Manager, AI Research, NVIDIA
The discipline of evaluating large language models underwent a major transformation with the rise of general AI capabilities. Today, the field is undergoing yet another challenging transformation following the groundbreaking improvements in agentic tasks, which expect models and systems to plan and take autonomous actions in the real world. Measuring how well models and systems perform in such tasks is however still i) fragile from a methodological perspective, and ii) difficult to scale and generalize across different domains. This talk will first discuss common challenges in reproducing agentic evaluations, including differences in reference implementation, error handling, trajectory post processing, and tooling definitions. Next, it will cover infrastructural requirements that need to be addressed for such evaluations to run efficiently at scale. Finally, we will conclude with a set of (still nascent) best practices that can help alleviate “lightness” and build more consistent measurement pipelines.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: LLM Foundations
View skill →Related AI Lessons
🎓
Tutor Explanation
DeepCamp AI