Interpreting Video Representations with Spatio-Temporal Sparse Autoencoders

📰 ArXiv cs.AI

Spatio-Temporal Sparse Autoencoders improve video representation interpretation by recovering temporal coherence

advanced Published 7 Apr 2026

Action Steps

Apply standard Sparse Autoencoders (SAEs) to video representations to decompose them into interpretable features
Use spatio-temporal contrastive objectives to improve temporal coherence
Implement Matryoshka hierarchical grouping to further enhance feature assignments and autocorrelation

Who Needs to Know This

Machine learning researchers and engineers working on video analysis tasks can benefit from this study to improve their models' performance and interpretability

Key Insight

💡 Spatio-temporal contrastive objectives and hierarchical grouping can recover and exceed raw temporal coherence in video representations