📰 Dev.to · Maya Andersson

2 articles · Updated every 3 hours · View all reads

All Articles 88,433 Blog Posts 108,028 Tech Tutorials 21,941 Research Papers 18,911 News 14,555 ⚡ AI Lessons

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

Dev.to · Maya Andersson 🧠 Large Language Models ⚡ AI Lesson 6d ago

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of...

More eval traces will not stabilize your kappa. Stratify the ones you have

Dev.to · Maya Andersson 1w ago

More eval traces will not stabilize your kappa. Stratify the ones you have

TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63...