StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis
📰 ArXiv cs.AI
arXiv:2510.10074v2 Announce Type: replace Abstract: Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution
DeepCamp AI