A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

📰 ArXiv cs.AI

arXiv:2512.20798v4 Announce Type: replace Abstract: As autonomous AI agents are deployed in high-stakes environments, ensuring their safety has become a paramount concern. Existing safety benchmarks primarily evaluate whether agents refuse explicitly harmful instructions or maintain procedural compliance, but few capture emergent outcome-driven constraint violations: failures that arise when agents pursue goal optimization under performance pressure while deprioritizing ethical, legal, or safety

Published 15 Apr 2026

Read full paper → ← Back to Reads