Keywords: Sparse Autoencoders, Interpretability, Dictionary Learning
Abstract: Interpretability strives to discover the concepts learned and represented by models, frequently with unsupervised learning methods such as dictionary learning. While such methods allow for flexibility and applicability to a wide variety of domains, recent implementations such as sparse autoencoders (SAEs) discard valuable modality-specific information and priors that we have on the structure of different data modalities. In this work, we argue that the temporal dimension of language is a rich feature source that can be leveraged by dictionary learning methods in a self-supervised manner, allowing for better learning and disentanglement of semantic and syntactical features represented by language models. We propose a data-generating process for such features, which informs a novel approach to train Temporal SAEs that can extract semantic concepts from natural language. We experimentally verify that accounting for the temporal structure of language improves SAEs' ability to capture semantic features in text data with minimal loss in performance.
Submission Number: 29
Loading