A Postmortem on Autonomous LLM-as-Judge: How My Eval Agent Got Two Verdicts Wrong Before I Found a Sandbox Bug
📰 Dev.to · 陈瀚
I run an autonomous eval agent against new coding-agent stacks before trusting their numbers. The...
I run an autonomous eval agent against new coding-agent stacks before trusting their numbers. The...