Test LLM Eval Wins in Python: Bootstrap Confidence Intervals
About this lesson
Is that prompt win real? Learn session-aware bootstrap confidence intervals to tell whether a prompt’s eval margin is stable or just noise. Build a compact Python pipeline that turns paired judgments into CI-backed margins, a ship-or-hold decision, and an explicit regression risk estimate. Code examples use Python and NumPy (np, np.random.default_rng) for reproducible resampling and session-block bootstrapping. Subscribe for concise AI engineering and LLM evaluation tutorials. #MachineLearning #AIEngineering #LLMs #Python #NumPy #ModelEvaluation #Bootstrap
Full Transcript
A prompt can win your evals, but is that win real? You'll understand the pattern behind a pipeline that [music] runs bootstrap comparisons and reports when a prompt win is significant. I'm Professor Py, teaching AI engineering and LLM systems with simple Python. >> [music] >> Evaluations for prompts and model tweaks are full of small numbers and [music] big feelings. You run 20 or 50 paired judgments and one prompt [music] nudges ahead. Managers want a yes or no. Engineers want a number plus an honest uncertainty band. Bootstrap confidence [music] intervals give you that. They turn a single table into a distribution of plausible outcomes, so you can see whether a [music] win is stable or just noise. Intuitively, bootstrap resampling pretends you could rerun the same evaluation many times by drawing from the existing judgments with replacement. [music] Each resample yields a new value for your [music] metric. The spread of those values becomes a confidence interval that tells you how [music] much the observed margin might wiggle if you ran the study again under similar conditions. We will build a small pipeline that starts with raw [music] paired judgments, computes a baseline margin, bootstraps a confidence interval, simulates an alternative prompt, respects session level correlations, and finally gives a ship or hold signal with an explicit regression risk. The code is compact and meant to be easy to read. Quick note, this code is designed to teach the core idea clearly. Library versions and local setups can vary, so check the official docs if your setup behaves differently. First, we compute the baseline margin from the raw eval table, so we know what number we must explain. This snippet measures the baseline win margin from raw eval outcomes. Engineers worry that a small eval table can make a 55% win rate look definitive. So I start by encoding each paired [music] judgment in eval outcomes as one for a new prompt win, zero for a tie, >> [music] >> and negative one for a loss. The numpy mean on boolean comparisons gives win [music] rate and loss rate without loops, keeping the math reproducible on any [music] rerun. Subtracting loss rate from win rate reveals margin, the figure you actually care about when shipping. The printed baseline margin shows exactly how far the candidate leads today. Next, [music] we need to know how noisy that margin is. So we resample the judgments many times to build a confidence [music] band. This example bootstraps random draws to build a confidence interval for that margin. With rng = np.random.default_rng(42), I lock the resampling sequence so coworkers can reproduce [music] every run. rng.choice resamples the 20 eval outcomes with [music] replacement nboot times, yielding samples whose rows mimic new evals drawn from the same pool. Because the list uses one, zero, and negative [music] one encodings, np.mean on each row returns the net win margin directly. [music] And np.percentile summarizes those margins into the familiar 95% band. Increasing nboot narrows Monte Carlo noise, but costs compute. Decreasing it makes the interval jumpier. The printed bootstrap margin CI shows how uncertain the baseline margin really is. With a CI in hand, we simulate a concrete alternate prompt by adjusting a few judgments, then compare the bootstrapped margins. This code contrasts two prompt variants by ranking their bootstrapped margins. To model a rag tweak, the fix some tie cases, alt outcomes copies [music] eval outcomes and flips three zero judgments to wins before drawing new bootstrap samples. np.mean [music] again converts each alt samples row into a margin. And max over a tiny list of tuples keeps only the prompt [music] whose average bootstrap margin is largest. Best label and best margin make the comparison explicit for later steps without touching disk or spreadsheets. The printed top prompt [music] line shows which variant currently leads and by how much under this simulation. But plain resampling treats every [music] judgment as independent, which is wrong when the same user produced several rows. We fix that by resampling session blocks instead of individual judgments. This snippet reruns the bootstrap at the session level to respect correlated evaluations. If best label brought those [music] wins by dominating a handful of test conversations, sampling individual [music] rows would exaggerate stability. The small if block selects the right outcome array, [music] then reshape 5 4 groups the data into five user sessions of four questions each. rng.choice now resamples session blocks [music] row indices, so every draw keeps conversational structure intact before np.mean computes a margin per replicate. np.percentile converts the session-aware distribution into a clean interval. The printed session R margin CI shows what margin range survives when you control for repeated users. Finally, we turn the interval into a concrete ship or hold rule and quantify regression risk. This example turns the interval into a ship or hold decision with an explicit regression risk. Baseline margin est tracks the earlier prompt V2 bootstrap center, letting us judge the new variant against a familiar benchmark. [music] The decision expression only ships when the lower end of session CI clears [music] that baseline. While regression risk counts how often the session level bootstrap >> [music] >> falls back to baseline margin est or worse. Both metrics reuse the same block margins array. So, every teammate applying this rule sees identical numbers. The printed ship decision and regression [music] risk expose the go, no-go call and how much downside remains if you ship. In a real system, you map eval outcomes to the set of held out conversations you use in pre-launch tests. And you treat session blocks as actual users or conversation IDs. So, the CI matches production traffic. What this gives you is a clear signal instead of a gut call. Use session-aware bootstrap CIs whenever your evals contain [music] repeated users or clustered examples. Also, remember a narrow interval needs enough base judgments or more resamples. Otherwise, [music] the confidence looks stronger than it really is. To push this further, increase nboot [music] and rerun the session level bootstrap until the Monte Carlo noise is small relative to your decision threshold. Then re-evaluate ship decision and regression risk. If practical AI engineering helps, subscribe and watch the AI engineering videos.
Original Description
Is that prompt win real? Learn session-aware bootstrap confidence intervals to tell whether a prompt’s eval margin is stable or just noise.
Build a compact Python pipeline that turns paired judgments into CI-backed margins, a ship-or-hold decision, and an explicit regression risk estimate.
Code examples use Python and NumPy (np, np.random.default_rng) for reproducible resampling and session-block bootstrapping.
Subscribe for concise AI engineering and LLM evaluation tutorials. #MachineLearning #AIEngineering #LLMs #Python #NumPy #ModelEvaluation #Bootstrap
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Prompt Systems Engineering
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Why the Best Companies Are Built with the Right People Around the Table
Medium · Startup
The AI House of Cards: Why Revolutionary Tech Breeds the Best Ponzis
Medium · Startup
The New Geography Of Entrepreneurship—How Founders Are Rethinking Where To Build
Forbes Innovation
Esports Company BLAST Reports Record Growth Following US Expansion
Forbes Innovation
🎓
Tutor Explanation
DeepCamp AI