Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

📰 ArXiv cs.AI

arXiv:2509.18127v3 Announce Type: replace-cross Abstract: Sparse autoencoders (SAEs) enable interpretability research by decomposing entangled model activations into monosemantic features. However, under what circumstances SAEs derive most fine-grained latent features for safety, a low-frequency concept domain, remains unexplored. Two key challenges exist: identifying SAEs with the greatest potential for generating safety domain-specific features, and the prohibitively high cost of detailed feat

Published 15 Apr 2026
Read full paper → ← Back to Reads