Keywords: Concept Discovery (e.g., SAEs, dictionary learning)
TL;DR: We introduce a new architecture for dictionary learning and show that it is competitive across a panel of synthetic and real-world tasks.
Abstract: Dictionary learning methods - such as Sparse Autoencoders (SAEs) and crosscoders - decompose model activations into human-interpretable building blocks. We introduce _temporal crosscoders_, a simple and flexible framework for feature discovery in Large Language Models (LLMs). To properly evaluate temporal crosscoders we develop TempBench: a panel of synthetic and real-world tasks for evaluating temporal structures. Temporal crosscoders outperform both conventional and temporal architectures in both of our synthetic settings and on two out of four of the real world settings - more than any other current architecture. Most strikingly, they can detect backtracking - a key reasoning behavior - at a 40% higher rate than conventional SAEs, and are 15% more effective in inducing it. Our results establish temporal crosscoders as a simple and flexible framework for feature discovery, both local and temporal. We provide full code at the following anonymous repository: \url{https://anonymous.4open.science/r/temp-bench-anon/}.
Submission Number: 254
Loading