Your eval criteria are code. Version them like code.

📰 Medium · Machine Learning

A judge prompt is an implementation. The criterion it encodes is a contract, and ours rotted for three months before anyone noticed. Continue reading on Medium »

Published 9 Jun 2026
Read full article → ← Back to Reads