AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair

📰 ArXiv cs.AI

Learn to identify and address evaluator-channel ranking instability in agent repair using AuditRepairBench, a paired-execution trace corpus

advanced Published 7 May 2026

Action Steps

Collect paired-execution traces using AuditRepairBench to analyze evaluator-channel ranking instability
Apply ranking instability metrics to the collected traces to identify failure modes
Configure agent repair methods to consult evaluator-derived signals during internal selection of candidate repairs
Test and evaluate the robustness of agent repair methods using the AuditRepairBench corpus
Compare the performance of different agent repair methods using the provided leaderboard

Who Needs to Know This

AI researchers and engineers working on agent repair and evaluator-channel ranking instability will benefit from this resource, as it provides a comprehensive dataset to test and validate their methods

Key Insight

💡 Evaluator-channel ranking instability can be operationalized and measured using paired-execution traces, enabling more robust agent repair methods