On the Failure of a Universal Linear Representation Hypothesis in Deep Neural Networks

On the Failure of a Universal Linear Representation Hypothesis in Deep Neural Networks

15 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM interpretability, linear representation hypothesis

TL;DR: We provide a comprehensive theoretical analysis demonstrating that the linear representation hypothesis cannot be universally true.

Abstract: The Linear Representation Hypothesis (LRH) posits that semantic concepts in deep neural networks are encoded as linear directions in activation space, recoverable via linear probes. We provide a comprehensive theoretical analysis demonstrating that LRH cannot be universally true. Our contributions include: (1) sharp combinatorial bounds using VC-dimension theory showing exponential gaps between concept complexity and linear decodability, (2) circuit complexity arguments proving that depth-separated functions cannot have linear intermediate representations without exponential dimension, (3) explicit algebraic constructions of non-linear concept families with detailed complexity analysis, (4) measure-theoretic results on the generic failure of linear representations, and (5) information-theoretic lower bounds on representation dimension. We complement theory with reproducible experiments and discuss precise conditions under which LRH holds approximately.

Submission Number: 227

Loading