From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

ICLR 2026 Conference Submission14999 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: disentanglement, interpretability, feature, SAE, causal representation learning

TL;DR: By testing multiple concepts simultaneously instead of in isolation, we can measure how often popular interpretability methods like SAEs learn truly independent concept representations.

Abstract: A central goal of interpretability is to recover representations of causally relevant concepts from the activations of neural networks. The quality of these concept representations is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear whether common featurization methods—including sparse autoencoders (SAEs) and sparse probes—recover disentangled representations of these concepts. This study proposes a multi-concept evaluation setting where we control the correlations between textual concepts, such as sentiment, domain, and tense, and analyze performance under increasing correlations between them. We first evaluate the extent to which featurizers can learn disentangled representations of each concept under increasing correlational strengths. We then investigate whether concepts are sufficiently captured by single features or require multiple dimensions; using $k$-sparse probes, we find that $k$ often needs to be much greater than 1 for optimal scores. Finally, we perform a causal investigation where we steer multiple features simultaneously and observe whether each concept is independently manipulable. Even under ideal uniform distributions of concepts, we find that unsupervised methods like SAEs struggle to learn disentangled concept representations. We then find that the feature representations we identify correspond to disjoint subspaces in activation space, but also that steering with the top feature for one concept still often affects other concepts; this suggests a fundamental entanglement of concepts in the model's representation space. These findings underscore the importance of compositional and out-of-distribution evaluations in interpretability research.

Primary Area: interpretability and explainable AI

Submission Number: 14999

Loading