On Randomness in Agentic Evals

📰 ArXiv cs.AI

Agentic system evaluations may not be reliable due to substantial variance in single-run performance estimates

advanced Published 26 Mar 2026
Action Steps
  1. Collect a large number of agentic trajectories to estimate performance variance
  2. Analyze the variance in single-run pass@1 estimates to determine reliability
  3. Consider using multiple runs or alternative evaluation metrics to improve reliability
Who Needs to Know This

AI researchers and engineers working on agentic systems can benefit from understanding the limitations of current evaluation methods, as it can impact the development of more robust and reliable models

Key Insight

💡 Single-run performance estimates may not be reliable for agentic systems due to substantial variance

Share This
🤖 Agentic system evaluations may be flawed due to high variance in single-run performance estimates
Read full paper → ← Back to News