Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

📰 ArXiv cs.AI

arXiv:2604.19533v1 Announce Type: cross Abstract: We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gy

Published 22 Apr 2026

Read full paper → ← Back to Reads