The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA

PyTorch · Intermediate ·🏗️ Systems Design & Architecture ·2w ago
The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA Rapid advances in AI have expanded the range of capabilities required for successful real-world deployment. Understanding where we are in this multi-dimensional frontier is essential for accelerating innovation through effective quality assurance. Rigorous evaluation is increasingly difficult to scale as development requires testing many checkpoints across numerous benchmarks. Model comparison is further complicated by limited transparency of reported results. This talk explores challenges, best practices, and open-source tools that elevate evaluation to a core component of LLM development, delivering continuous signals across the model lifecycle. We discuss principles for standardizing evaluation methods and improving consistency through practical patterns and anti-patterns, and examples of integrating the science of evaluation directly into model development. Using Nemo-Evaluator, an open-source scalable evaluation tool, we demonstrate modular architectures that enable transparent, reproducible measurement. Finally, we show how Nemo-Evaluator supports reproducible evaluation for the Nemotron model family, helping enable one of the most open development processes in modern AI.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

My Load Balancer Handles 5M RPS: Architecture and Lessons Learned
Learn how to scale your load balancer to handle 5M RPS and the key lessons learned from this experience
Dev.to · speed engineer
ABAP CDS Views Series — Part 9: Hierarchies, Recursive Structures, and Tree-Based Data Modeling in SAP S/4HANA
Learn to model hierarchical and recursive data in SAP S/4HANA using ABAP CDS Views, enabling efficient querying and analysis of complex data structures.
Dev.to · Oktay Ates
Developers Are Building Backends in React Now
Learn how React Server Components and Next.js are changing the way developers build backends for modern web applications
Medium · Programming
Type-Driven Domain Design in Go: Encoding Invariants at Compile Time
Learn how to encode invariants at compile time in Go using type-driven domain design to reduce runtime bugs
Dev.to · Gabriel Anhaia
Up next
Microsoft Access Table Design & Data Integrity
Coursera
Watch →