Mechanistic Interpretability of Animacy Effects on Structure Choice in GPT-2

Published: 18 May 2026, Last Modified: 18 May 2026CoNLL 2026 ArchivalEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, animacy, structure choice, activation patching, psycholinguistics
TL;DR: We use mechanistic interpretability methods to show that animacy in GPT-2 causally drive syntactic structure choice, not just behavioral alignment with human performance.
Abstract: Language models (LMs) exhibit human-like behavior across linguistic tasks, yet behavioral similarity does not establish mechanistic correspondence. Animacy — whether an entity is alive and sentient — is a well-documented semantic feature shaping linguistic behavior in humans. Although LMs show animacy sensitivity behaviorally, the mechanistic basis remains unexplored. In this study, we probe GPT-2 Small's internal circuitry to test whether animacy representations causally drive syntactic structure choice. Activation patching confirms causality: swapping animacy representations in the model shifts its downstream output. Critically, bidirectional patching reveals that animacy conditions differ in how strongly they commit to a structure: some animacy configurations resist perturbation and exert strong causal influence, while others remain flexible. We identify 22 attention heads mediating these effects, split between passive-promoting and passive-suppressing populations, suggesting GPT-2 Small's structure choice likely emerges from internal competition between opposing heads. These findings provide mechanistic grounding for animacy effects documented in extensive psycholinguistics research and demonstrate how interpretability methods can enrich and test psycholinguistic theory.
Scope Confirmation: To the best of my judgment, this submission falls within the scope of CoNLL.
Primary Area Selection: Computational Psycholinguistics, Cognition and Linguistics
Use Of Generative Artificial Intelligence Tools: Yes, for editing/proofreading the manuscript
Data Collection From Human Subjects: No
Submission Type: Archival: I certify that the submission has not been previously published, nor is the material in it under review by another journal or conference. Further, no material in it will be submitted for review at another conference or journal while under review by CoNLL 2026.
Submission Number: 176
Loading