Fine-grained Analysis of Brain-LLM Alignment through Input Attribution

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Brain alignment, Brain-LLM aligment, Input attribution, Next word prediction, Explainiable artificial intelligence
TL;DR: Input attribution reveals why LLMs align with brain activity, highlighting both shared semantics and distinct mechanisms compared with next‑word prediction.
Abstract: Understanding the alignment between large language models (LLMs) and human brain activity can reveal computational principles underlying language processing. This work describes a pipeline to apply attribution methods to the brain-LLM alignment setting to identify the specific words most important for this alignment. As a case study, we leverage it to study a contentious research question about brain-LLM alignment: the relationship between brain alignment (BA) and next-word prediction (NWP). Across two naturalistic fMRI datasets, we find that BA and NWP rely on largely distinct word subsets: NWP exhibits recency and primacy biases with a focus on syntax, while BA prioritizes semantic and discourse-level information with a more targeted recency effect. This work advances our understanding of how LLMs relate to human language processing and highlights differences in feature reliance between BA and NWP. Beyond this study, our attribution method can be broadly applied to explore the cognitive relevance of model predictions in diverse language processing tasks.
Supplementary Material: zip
Primary Area: applications to neuroscience & cognitive science
Submission Number: 16744
Loading