The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA

Name: The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA
Uploaded: 2026-04-20T20:23:38Z
Channel: PyTorch
Description: The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA Rapid advances in AI have expanded the range of capabilities re...

PyTorch · Intermediate ·🏗️ Systems Design & Architecture ·2w ago

Skills: RAG Evaluation90%AI Systems Design60%

The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA Rapid advances in AI have expanded the range of capabilities required for successful real-world deployment. Understanding where we are in this multi-dimensional frontier is essential for accelerating innovation through effective quality assurance. Rigorous evaluation is increasingly difficult to scale as development requires testing many checkpoints across numerous benchmarks. Model comparison is further complicated by limited transparency of reported results. This talk explores challenges, best practices, and open-source tools that elevate evaluation to a core component of LLM development, delivering continuous signals across the model lifecycle. We discuss principles for standardizing evaluation methods and improving consistency through practical patterns and anti-patterns, and examples of integrating the science of evaluation directly into model development. Using Nemo-Evaluator, an open-source scalable evaluation tool, we demonstrate modular architectures that enable transparent, reproducible measurement. Finally, we show how Nemo-Evaluator supports reproducible evaluation for the Nemotron model family, helping enable one of the most open development processes in modern AI.

Watch on YouTube ↗ (saves to browser)