Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

📰 ArXiv cs.AI

arXiv:2508.01916v3 Announce Type: replace-cross Abstract: Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these ``natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable

Published 16 May 2026
Read full paper → ← Back to Reads