Priors in time: Missing inductive biases for language model interpretability

Published: 26 Jan 2026, Last Modified: 19 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Top-Down Interpretability, Sparse Autoencoders, Temporal Structure, Stationarity
TL;DR: We show that language model representations exhibit rich, non-stationary temporal dynamics, and demonstrate how mismatched priors in standard Sparse Autoencoders lead to characteristic failures—addressed by a new temporally biased SAE.
Abstract: A central aim of interpretability tools applied to language models is to recover meaningful concepts from model activations. Existing feature extraction methods focus on single activations regardless of the context, implicitly assuming independence (and therefore stationarity). This leaves open whether they can capture the rich temporal and context-sensitive structure in the activations of language models (LMs). Adopting a Bayesian view, we demonstrate that standard Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time. We then show that LM representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. This mismatch casts doubt on existing SAEs' ability to reflect temporal structures of interest in the data. We introduce a novel SAE architecture---Temporal SAE---with a temporal inductive bias that decomposes representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information that cannot be captured by the context. Experiments on LLM activations with Temporal SAE demonstrate its ability to correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Our results underscore the need for inductive biases that match the data in designing robust interpretability tools.
Primary Area: interpretability and explainable AI
Submission Number: 21354
Loading