Detecting Safety Violations Across Many Agent Traces
📰 ArXiv cs.AI
arXiv:2604.11806v1 Announce Type: new Abstract: To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges
DeepCamp AI