Disentangling Superpositions: Interpretable Brain Encoding Model with Sparse Concept Atoms

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretability, Neuroscience, fMRI, Encoding Models, Superposition
Abstract: Encoding models based on word embeddings or artificial neural network (ANN) features reliably predict brain responses to naturalistic stimuli but remain difficult to interpret. A central limitation is superposition—the entanglement of distinct semantic features along correlated directions in dense embeddings, which arises when latent features outnumber the embedding dimensions. This entanglement renders regression weights non-identifiable: different combinations of semantic directions can produce identical predicted brain activity, preventing principled interpretation of voxel selectivity. To overcome this, we introduce the Sparse Concept Encoding Model, which transforms dense embeddings into a higher-dimensional, sparse, and non-negative space of learned concept atoms. This transformation yields an axis-aligned semantic basis where each dimension corresponds to an interpretable concept, enabling direct readout of conceptual selectivity from voxel weights. When applied to fMRI data collected during story listening, our model matches the prediction performance of conventional dense models while substantially enhancing interpretability. It enables novel neuroscientific analyses such as disentangling overlapping cortical representations of time, space, and number, and revealing structured similarity among distributed conceptual maps. This framework offers a scalable and interpretable bridge between ANN-derived features and human conceptual representations in the brain.
Primary Area: Neuroscience and cognitive science (e.g., neural coding, brain-computer interfaces)
Submission Number: 19254
Loading