Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

📰 ArXiv cs.AI

arXiv:2509.09708v3 Announce Type: replace-cross Abstract: Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal

Published 29 Apr 2026
Read full paper → ← Back to Reads