MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

📰 ArXiv cs.AI

MetaSAEs improve sparse autoencoder latents with joint training and decomposability penalty for more atomic representations

advanced Published 7 Apr 2026

Action Steps

Identify the need for atomic sparse autoencoder latents in safety-relevant applications
Apply joint training with a decomposability penalty to improve latent representations
Evaluate the effectiveness of MetaSAEs in producing more coherent and interpretable latents

Who Needs to Know This

ML researchers and engineers working on safety-relevant applications, such as alignment detection and model steering, can benefit from MetaSAEs to produce more interpretable and coherent latent representations

Key Insight

💡 MetaSAEs can produce more atomic sparse autoencoder latents by penalizing the blending of representational subspaces