The Deleuzian Representation Hypothesis

The Deleuzian Representation Hypothesis

ICLR 2026 Conference Submission13997 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mechanistic Interpretability, Concept Extraction, Explainability

TL;DR: We propose a novel method for extracting interpretable concepts from a model's activations, by representing the differences between samples.

Abstract: We propose an alternative to sparse autoencoders (SAEs) as a simple and effective unsupervised method for extracting interpretable concepts from neural networks. The core idea is to cluster differences in activations, which we formally justify within a discriminant analysis framework. To enhance the diversity of extracted concepts, we refine the approach by weighting the clustering using the skewness of activations. The method aligns with Deleuze's modern view of concepts as differences. We evaluate the approach across five models and three modalities (vision, language, and audio), measuring concept quality, diversity, and consistency. Our results show that the proposed method achieves concept quality surpassing prior unsupervised SAE variants while approaching supervised baselines, and that the extracted concepts enable steering of a model’s inner representations, demonstrating their causal influence on downstream behavior.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 13997

Loading