Structural Inference: Interpreting Small Language Models with Susceptibilities

Garrett Baker; George Wang; Jesse Hoogland; Vinayak Pathak; Daniel Murfet

Structural Inference: Interpreting Small Language Models with Susceptibilities

Garrett Baker, George Wang, Jesse Hoogland, Vinayak Pathak, Daniel Murfet

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretability, Statistical Physics, Singular Learning Theory

TL;DR: Introduces susceptibilities to study the internal structure of language models

Abstract: We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.

Primary Area: interpretability and explainable AI

Submission Number: 20053

Loading