Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

📰 ArXiv cs.AI

arXiv:2603.11331v2 Announce Type: replace-cross Abstract: Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that strong adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the

Published 20 Apr 2026
Read full paper → ← Back to Reads