Datawizz - Benchmarking LLMs with LLM as Judge and Custom Metrics

Name: Datawizz - Benchmarking LLMs with LLM as Judge and Custom Metrics
Uploaded: 2025-05-26T19:53:27+00:00
Channel: Datawizz
Description: Curious about how different language models actually perform on your data—beyond just accuracy? In this video, we demo the new "LLM as Judge" feature on...

Datawizz · Intermediate ·🧠 Large Language Models ·10mo ago

Curious about how different language models actually perform on your data—beyond just accuracy? In this video, we demo the new "LLM as Judge" feature on the Datawizz platform, designed to give you granular, multi-metric insights into model quality. What You'll See: A walkthrough of how to set up model evaluations using multiple custom metrics (e.g., tone, truthfulness, brevity, comprehensiveness) Side-by-side benchmarking of OpenAI GPT-4, GPT-4o, and Together AI’s Llama 3 70B—scored by a neutral third-party model (Anthropic Claude 3.5) How to define your own evaluation metrics and prompts fo…

Watch on YouTube ↗ (saves to browser)