Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing
📰 ArXiv cs.AI
arXiv:2602.11786v2 Announce Type: replace-cross Abstract: Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories. However, real-world deployment often exposes a different class of risk: operational failures that arise under repeated inference on identical or near-identical prompts rather than from broad task-level underperformance. In high-stakes settings, response c
DeepCamp AI