A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
📰 ArXiv cs.AI
arXiv:2512.20798v4 Announce Type: replace Abstract: As autonomous AI agents are deployed in high-stakes environments, ensuring their safety has become a paramount concern. Existing safety benchmarks primarily evaluate whether agents refuse explicitly harmful instructions or maintain procedural compliance, but few capture emergent outcome-driven constraint violations: failures that arise when agents pursue goal optimization under performance pressure while deprioritizing ethical, legal, or safety
DeepCamp AI