On the Limits of Linear Representation Hypotheses in Large Language Models: A Dynamical Systems Analysis

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Foundational work, Steering
Other Keywords: Theoretical
TL;DR: We analyze residual networks through dynamical systems theory and show that small perturbations can diverge within $O(\log(1/\epsilon))$ layers, highlighting limits of linear interpretability assumptions.
Abstract: Linear representation hypotheses and steering vector interpretations are increasingly popular in mechanistic interpretability, suggesting that small perturbations in latent space yield predictable changes in model behavior. We provide a rigorous theoretical critique of this perspective by analyzing the chaotic dynamics inherent in deep residual networks through the lens of dynamical systems theory. We prove that two latent vectors which are initially $\epsilon$-close can diverge exponentially within $O(\log(1/\epsilon))$ layers under positive Lyapunov exponents, fundamentally undermining the assumption that linear operations reliably control model outputs. Our analysis reveals that the exponential sensitivity to initial conditions characteristic of chaotic systems makes linear approximations inherently unreliable in deep networks, providing a theoretical foundation for understanding the limitations of current interpretability methods.
Submission Number: 219
Loading