CocoaBench: Evaluating Unified Digital Agents in the Wild
📰 ArXiv cs.AI
arXiv:2604.11201v1 Announce Type: cross Abstract: LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digi
DeepCamp AI