📰 Dev.to · Maya Andersson
2 articles · Updated every 3 hours · View all reads
All
Articles 88,433Blog Posts 108,028Tech Tutorials 21,941Research Papers 18,911News 14,555
⚡ AI Lessons

Dev.to · Maya Andersson
🧠 Large Language Models
⚡ AI Lesson
6d ago
We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"
We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of...

Dev.to · Maya Andersson
1w ago
More eval traces will not stabilize your kappa. Stratify the ones you have
TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63...
DeepCamp AI