Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research

Sean Trott

Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research

Sean Trott

Published: 30 Sept 2025, Last Modified: 23 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Open Source Links: https://github.com/seantrott/mechinterp_generalizability

Keywords: Circuit analysis, Foundational work

Other Keywords: epistemology, generalizability, philosophy of science

TL;DR: Interpretability research lacks clear principles with which to guide generalizations across model instances; I propose several "axes of correspondence" along which claims might generalize and validate the framework in an empirical case study.

Abstract: Research on Large Language Models (LLMs) increasingly focuses on identifying mechanistic explanations for their behaviors, yet the field lacks clear principles for determining when (and how) findings from one model instance generalize to another. This paper addresses a fundamental epistemological challenge: given a mechanistic claim about a particular model, what justifies extrapolating this finding to other LLMs—and along which dimensions might such generalizations hold? I propose five potential *axes of correspondence* along which mechanistic claims might generalize, including: functional (whether they satisfy the same functional criteria), developmental (whether they develop at similar points during pretraining), positional (whether they occupy similar absolute or relative positions), relational (whether they interact with other model components in similar ways), and configurational (whether they correspond to particular regions or structures in weight-space). To empirically validate this framework, I analyze "1-back attention heads" (components attending to previous tokens) across pretraining in random seeds of the Pythia models (14M, 70M, 160M, 410M). The results reveal striking consistency in the *developmental trajectories* of 1-back attention across models, while positional consistency is more limited. Moreover, seeds of larger models systematically show earlier onsets, steeper slopes, and higher peaks of 1-back attention. I also address possible objections to the arguments and proposals outlined here. Finally, I conclude by arguing that progress on the generalizability of mechanistic interpretability research will consist in mapping constitutive design properties of LLMs to their emergent behaviors and mechanisms.

Submission Number: 136

Loading