ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

📰 ArXiv cs.AI

arXiv:2509.25843v2 Announce Type: replace Abstract: Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard)

Published 15 Apr 2026

Read full paper → ← Back to Reads