Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
📰 ArXiv cs.AI
arXiv:2509.09708v3 Announce Type: replace-cross Abstract: Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal
DeepCamp AI