Leveraging the Sequential Nature of Language for Interpretability

Usha Bhalla; Alex Oesterling; Claudio Mayrink Verdun; Flavio Calmon; Himabindu Lakkaraju

Leveraging the Sequential Nature of Language for Interpretability

Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Flavio Calmon, Himabindu Lakkaraju

Published: 10 Jun 2025, Last Modified: 14 Jul 2025ICML 2025 World Models WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Autoencoders, Interpretability, Dictionary Learning

Abstract: Interpretability strives to discover the concepts learned and represented by models, frequently with unsupervised learning methods such as dictionary learning. While such methods allow for flexibility and applicability to a wide variety of domains, recent implementations such as sparse autoencoders (SAEs) discard valuable modality-specific information and priors that we have on the structure of different data modalities. In this work, we argue that the temporal dimension of language is a rich feature source that can be leveraged by dictionary learning methods in a self-supervised manner, allowing for better learning and disentanglement of semantic and syntactical features represented by language models. We propose a data-generating process for such features, which informs a novel approach to train Temporal SAEs that can extract semantic concepts from natural language. We experimentally verify that accounting for the temporal structure of language improves SAEs' ability to capture semantic features in text data with minimal loss in performance.

Submission Number: 29

Loading