The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

📰 ArXiv cs.AI

arXiv:2604.06436v3 Announce Type: replace-cross Abstract: We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $\ep

Published 14 Apr 2026
Read full paper → ← Back to Reads