The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
📰 ArXiv cs.AI
arXiv:2604.06436v3 Announce Type: replace-cross Abstract: We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $\ep
DeepCamp AI