Datawizz - Benchmarking LLMs with LLM as Judge and Custom Metrics

Datawizz · Intermediate ·🧠 Large Language Models ·10mo ago
Curious about how different language models actually perform on your data—beyond just accuracy? In this video, we demo the new "LLM as Judge" feature on the Datawizz platform, designed to give you granular, multi-metric insights into model quality. What You'll See: A walkthrough of how to set up model evaluations using multiple custom metrics (e.g., tone, truthfulness, brevity, comprehensiveness) Side-by-side benchmarking of OpenAI GPT-4, GPT-4o, and Together AI’s Llama 3 70B—scored by a neutral third-party model (Anthropic Claude 3.5) How to define your own evaluation metrics and prompts fo…
Watch on YouTube ↗ (saves to browser)

Chapters (9)

Intro: Why multi-metric evaluation?
0:37 Demo setup: Conversation summary task
1:07 Loading sample logs & evaluation config
1:42 Choosing models and judge
2:13 Setting custom evaluation metrics
3:20 Metric definitions: comprehensiveness, truthfulness, brevity, tone
4:41 How the evaluation runs (calls & cost considerations)
5:26 Live results, radar charts, and insights
5:56 Key findings and next steps
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)