Crosscoding Through Time: Sparse Feature Discovery Across Sequence Positions

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Concept Discovery (e.g., SAEs, dictionary learning)
TL;DR: We introduce a new architecture for dictionary learning and show that it is competitive across a panel of synthetic and real-world tasks.
Abstract: Dictionary learning methods - such as Sparse Autoencoders (SAEs) and crosscoders - decompose model activations into human-interpretable building blocks. We introduce _temporal crosscoders_, a simple and flexible framework for feature discovery in Large Language Models (LLMs). To properly evaluate temporal crosscoders we develop TempBench: a panel of synthetic and real-world tasks for evaluating temporal structures. Temporal crosscoders outperform both conventional and temporal architectures in both of our synthetic settings and on two out of four of the real world settings - more than any other current architecture. Most strikingly, they can detect backtracking - a key reasoning behavior - at a 40% higher rate than conventional SAEs, and are 15% more effective in inducing it. Our results establish temporal crosscoders as a simple and flexible framework for feature discovery, both local and temporal. We provide full code at the following anonymous repository: \url{https://anonymous.4open.science/r/temp-bench-anon/}.
Submission Number: 254
Loading