The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA
The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA
Rapid advances in AI have expanded the range of capabilities required for successful real-world deployment. Understanding where we are in this multi-dimensional frontier is essential for accelerating innovation through effective quality assurance. Rigorous evaluation is increasingly difficult to scale as development requires testing many checkpoints across numerous benchmarks. Model comparison is further complicated by limited transparency of reported results. This talk explores challenges, best practices, and open-source tools that elevate evaluation to a core component of LLM development, delivering continuous signals across the model lifecycle.
We discuss principles for standardizing evaluation methods and improving consistency through practical patterns and anti-patterns, and examples of integrating the science of evaluation directly into model development. Using Nemo-Evaluator, an open-source scalable evaluation tool, we demonstrate modular architectures that enable transparent, reproducible measurement. Finally, we show how Nemo-Evaluator supports reproducible evaluation for the Nemotron model family, helping enable one of the most open development processes in modern AI.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: RAG Evaluation
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
My Load Balancer Handles 5M RPS: Architecture and Lessons Learned
Dev.to · speed engineer
ABAP CDS Views Series — Part 9: Hierarchies, Recursive Structures, and Tree-Based Data Modeling in SAP S/4HANA
Dev.to · Oktay Ates
Developers Are Building Backends in React Now
Medium · Programming
Type-Driven Domain Design in Go: Encoding Invariants at Compile Time
Dev.to · Gabriel Anhaia
🎓
Tutor Explanation
DeepCamp AI