Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

📰 ArXiv cs.AI

arXiv:2602.11786v2 Announce Type: replace-cross Abstract: Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories. However, real-world deployment often exposes a different class of risk: operational failures that arise under repeated inference on identical or near-identical prompts rather than from broad task-level underperformance. In high-stakes settings, response c

Published 29 Apr 2026
Read full paper → ← Back to Reads