SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents
📰 ArXiv cs.AI
arXiv:2606.02302v1 Announce Type: cross Abstract: Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluations. Current agent security benchmarks often rely on manually curated tasks, provide limited coverage of emerging threats, and focus primarily on final outcomes rather tha
DeepCamp AI