On Randomness in Agentic Evals

📰 ArXiv cs.AI

Agentic system evaluations may not be reliable due to substantial variance in single-run performance estimates

advanced Published 26 Mar 2026

Action Steps

Collect a large number of agentic trajectories to estimate performance variance
Analyze the variance in single-run pass@1 estimates to determine reliability
Consider using multiple runs or alternative evaluation metrics to improve reliability

Who Needs to Know This

AI researchers and engineers working on agentic systems can benefit from understanding the limitations of current evaluation methods, as it can impact the development of more robust and reliable models

Key Insight

💡 Single-run performance estimates may not be reliable for agentic systems due to substantial variance