High Frequency Latents Are Features, Not Bugs

Published: 05 Mar 2025, Last Modified: 18 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 4 pages)
Keywords: Mechanistic Interpretability, Sparse Autoencoders, Features, Language Models, Latents
TL;DR: We find that high frequency latents in sparse autoencoders likely represent genuine dense language model features.
Abstract: Sparse autoencoders (SAEs) have shown success at decomposing language model activations into a sparse set of interpretable linear representations ("latents"). However, recent work identifies a challenge for SAEs: high frequency latents (HFLs) that are seemingly uninterpretable and occur on greater than 10\% of tokens. In this work, we find that HFLs have many unique properties: 1) most HFLs have a ``pair'', another HFL pointing in the geometrically opposite direction that they never co-occur with; 2) the HFL subspace is robust to the SAE initialization seed, but HFLs themselves are not; 3) when an SAE is trained on activations with the HFL subspace ablated, no new HFLs are learned; and 4) HFLs have uniquely high similarity with the SAE bias vector. Our experiments lead us to hypothesize that the HFL subspace is not an artifact of SAE training, but instead represents a subspace of truly dense language model features. We present preliminary results interpreting this dense subspace, including finding HFLs that represent context position, HFLs that fire continuously on large blocks of text, HFLs that fire on topic sentences, and HFLs that fire on numeric data.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 63
Loading