The Centroid Affinity Hypothesis: How Deep Network Represent Features

ICLR 2026 Conference Submission22224 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretability, Features
Abstract: Understanding and identifying the features of the input space a deep network (DN) extracts to form its outputs is a focal point of interpretability research, as it enables the reliable deployment of DNs. The current prevailing strategy of operating under the linear representation hypothesis (LRH) -- where features are characterised by directions in a DN's latent space -- is limited in its capacity to identify features relevant to the behaviour of components of the DN (e.g. a neuron or a layer). In this paper, we introduce the centroid affinity hypothesis (CAH) as a strategy through which to identify these features grounded in the behaviour of the DN’s components. We theoretically develop the CAH by exploring how continuous piecewise affine DNs -- such as those using the ReLU activation function -- influence the geometry of regions of the input space. In particular, we show that the centroids of a DN -- which are vector summarisations of the DN's Jacobians -- form affine subspaces to extract features of the input space. Importantly, we can continue to utilise LRH-derived tools, such as sparse autoencoders, to study features through the CAH, along with novel CAH-derived tools. We perform an array of experiments demonstrating how interpretability under the CAH compares to interpretability under the LRH: We can obtain sparser feature dictionaries from the DINO vision transformers that perform better on downstream tasks. We can directly identify neurons in circuits of GPT2-Large. We can train probes on Llama-3.1-8B that better capture the action of generating truthful statements.
Primary Area: interpretability and explainable AI
Submission Number: 22224
Loading