Rubric-Based LLM-as-Judge: Consistent Eval Scores in Python

Professor Py: AI Engineering · Beginner ·🛠️ AI Tools & Apps ·4mo ago

Skills: LLM Engineering90%

Key Takeaways

This video demonstrates a rubric-based LLM evaluation pipeline in Python, which scores model answers based on criteria such as coverage, brevity, and instruction-following, and provides a reproducible workflow for model selection and automated feedback. The pipeline uses a compact Python code to score answers, weight signals, compare candidates, anchor scores for stability, and run mini-batch comparisons.

Full Transcript

Bad rubrics make good evalos. [music] This is the workflow you'll know how to run when we're done, which scores answers by rubric criteria and [music] weights them into a structured, comparable score. I'm Professor Pi. Teaching AI [music] engineering and LLM systems with simple Python. Evaluating model answers feels vague until you make it mechanical. People reach for a rubric because they [music] need consistent, repeatable judgments. That matters for model selection, training feedback, and quality gates. When you push models into production, with a rubric, you turn subjective impressions into numbers you can track, [music] compare, and improve. At a high level, a rubric breaks [music] a response into signals. Coverage checks required facts. Brevity measures length. Follow checks whether [music] instructions were obeyed. Combine those signals with weights and you get a single structured score that reflects your [music] priorities. Strengths are clarity and reproducibility. Weaknesses are overfitting to the rubric if you bake in the wrong priorities. We will build a compact pipeline that scores [music] one answer, then weights signals, compares candidates, anchors scores for stability, and finally [music] runs many batch evaluations. It uses tiny explicit rubrics [music] so every decision is traceable and easy to tune. Let us start by scoring a single answer with a tiny deterministic rubric. Now we'll score one answer with a tiny deterministic rubric. This snippet scores a single model answer using a tiny deterministic rubric. The variable keywords lists three must-h have terms derived [music] from the prompt. The function judge basic computes coverage as the fraction of those terms present in the response. It also computes a simple concision check based on character [music] length, encouraging brief answers. These two signals are [music] averaged with equal weights to form a base score, then rounded to two decimals for [music] clean tracking. Shortening the response increases concision but risks [music] missing required terms while longer text can reduce the concision signal. The printed [music] base score shows a reproducible numeric judgment for one answer. Next, we add explicit weights so the rubric matches priorities like precision or brevity. This example introduces an explicit [music] weighted rubric and returns structured scores. The dictionary rubric defines weights for coverage, [music] brevity, and follow, letting you control their influence. The function judge rubric [music] computes coverage like before. Then a brevity signal from word count [music] and a follow signal based on sentence count to reflect instruction following. It multiplies each signal by its weight, sums them, [music] and returns a dictionary containing each component and the total all rounded for stable [music] logging. Increasing coverage weight magnifies keyword presence. Decreasing [music] it shifts emphasis to brevity or following directions. The printed weighted total confirms a combined [music] score driven by explicit tunable criteria. Now we compare two answers head-to-head to pick the better model. This code compares two candidate answers using the structured rubric and picks a winner deterministically. The variable cans maps model labels to their responses. The dictionary comprehension scores each candidate with judge rubric, [music] extracting the total for a clean numeric comparison. The sorted call ranks by descending score and breaks ties by label name to ensure stable ordering across runs. This approach makes head-to-head comparisons [music] reproducible and debugable because scores are derived from transparent signals. If you tighten the brevity [music] threshold in judge rubric, longer candidates will drop in rank. The printed winner [music] identifies which candidate best meets the current rubric. Next, [music] we anchor scores against fixed examples so comparisons stay stable as your models drift. This snippet normalizes rubric scores against weak and strong anchors to stabilize comparisons over time. The variables [music] anchor weak and anchor strong serve as fixed reference answers that set [music] the lower and upper bounds. Using judge rubric, the code measures a low and a high. Then the function calibrate maps any raw score to a 0ero to one scale with clipping [music] for safety. The variable SA captures the unccalibrated total for a candidate and [music] the printed value reports its anchored score. Raising anchor strong quality [music] increases the spread required for high calibrated scores. Lowering anchor weak compresses low-end [music] differences. The printed calibrated A shows how one answer [music] fares on a consistent anchored scale. Finally, we run a small batch of prompts to pick the best model [music] across tasks. This example evaluates multiple prompts in a batch and reports the best overall model. The list cases holds per prompt keyword rubrics and each model's response, keeping inputs explicit and reproducible. For each case, the function judge rubric computes a total score by the same weights used earlier. The variables SA and SB sum totals across the batch, effectively representing unnormalized averages since each case counts once. The comparison includes a tall break on model label to keep results stable when scores match. If you add [music] more cases, the aggregated decision becomes less sensitive to [music] any single prompt. The printed run best announces the top performer for [music] this mini evaluation run. This toy pipeline maps cleanly into real projects. Replace the tiny keyword lists with domain checklists. Use anchor answers from expert [music] raiders. Run the batch evaluation over hundreds of prompts to stabilize selection. These small moves take [music] the example from lab demo to a repeatable model selection process. Recap. [music] It is a rubricbased LLM judge that turns criteria into numeric signals. Use it when you need [music] reproducible comparisons, model selection, or automated feedback at scale. [music] Caveat: A bad rubric gives precise but wrong answers. So validate [music] your signals. Next step, tune the coverage and brevity weights across a [music] larger set of cases to see how model rankings change. If short practical [music] AI engineering helps, subscribe and watch the AI engineering

Original Description

Rubric-based LLM evaluation: learn a compact Python pipeline to score, weight, and compare model answers deterministically. Get a reproducible workflow that turns coverage, brevity, and instruction-following into numeric signals, anchors scores for stability, and runs mini-batch comparisons for reliable model selection. Includes tiny deterministic rubrics and Python code you can adapt for anchors, weights, and larger-scale evaluations. Subscribe for practical AI engineering and LLM systems tutorials from Professor Py. #LLMEvaluation #Rubrics #ModelSelection #Python #AIEngineering #MLOps #PromptEvaluation

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

This video teaches how to build a rubric-based LLM evaluation pipeline in Python, which provides a reproducible workflow for model selection and automated feedback. The pipeline scores model answers based on criteria such as coverage, brevity, and instruction-following, and uses a weighted rubric and anchored scores for stable comparisons. The video demonstrates how to implement this pipeline using a compact Python code.

Key Takeaways

Define a rubric with criteria such as coverage, brevity, and instruction-following
Implement a weighted rubric using a dictionary
Score model answers using the weighted rubric
Anchor scores for stability using fixed reference answers
Run mini-batch comparisons to select the best model

💡 A bad rubric can give precise but wrong answers, so it's essential to validate the signals and tune the weights across a larger set of cases

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related Reads

I Wanted To Automate Website Testing Without Writing Hundreds Of Scripts. AI Changed The Approach.

Learn how AI-powered testing is revolutionizing website testing by shifting from script-based automation to goal-based testing, increasing efficiency and reducing manual effort

Claude Count Tokens API: Know What a Request Costs Before You Send It (Beginner’s Guide)

Learn to use the Claude Count Tokens API to estimate request costs before sending them, optimizing your AI workflow

Claude Count Tokens API: Know What a Request Costs Before You Send It (Beginner’s Guide)

Learn to use the Claude Count Tokens API to estimate request costs before sending them, optimizing your workflow and budget

Medium · Programming

Smarter Syncing: The Rise of AI in Your Cloud Storage

Learn how AI is revolutionizing cloud storage by predicting user needs and optimizing data compression, and why it matters for professionals

Microsoft Bot Framework Web Chat Authentication with Microsoft Graph API Call using Auth Token in C#

Dewiride Technologies