Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows
📰 ArXiv cs.AI
arXiv:2604.25345v1 Announce Type: new Abstract: Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood. We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks. In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs. ~0 without context), with the primary failure mode being silent incorrect computation
DeepCamp AI