Keywords: LLM interpretability, linear representation hypothesis
TL;DR: We provide a comprehensive theoretical analysis demonstrating that the linear representation hypothesis cannot be universally true.
Abstract: The Linear Representation Hypothesis (LRH) posits that semantic concepts in deep neural networks are encoded as linear directions in activation space, recoverable via linear probes. We provide a comprehensive theoretical analysis demonstrating that LRH cannot be universally true. Our contributions include: (1) sharp combinatorial bounds using VC-dimension theory showing exponential gaps between concept complexity and linear decodability, (2) circuit complexity arguments proving that depth-separated functions cannot have linear intermediate representations without exponential dimension, (3) explicit algebraic constructions of non-linear concept families with detailed complexity analysis, (4) measure-theoretic results on the generic failure of linear representations, and (5) information-theoretic lower bounds on representation dimension. We complement theory with reproducible experiments and discuss precise conditions under which LRH holds approximately.
Submission Number: 227
Loading