Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Interpretability for Knowledge Discovery
Other Keywords: crosscoders, neuroscience
TL;DR: We decompose brain and language model representations into shared and specific features via crosscoders, finding that embodied semantics are brain-specific and colloquial expressions are LM-specific.
Abstract: To what extent do human brains and language models (LMs) share internal representations of language, and how do these representations differ? Prior work has shown that LM representations can predict brain responses to naturalistic language stimuli, suggesting that the two systems encode common information. However, which features are shared between brain and LM representations and which are selectively used in brains and LMs have remained underspecified. We propose Brain-LM crosscoders, which decompose brain responses and LM representations into shared sparse features and label each feature as being shared, brain-specific, or LM-specific based on its predictive contribution to each representation. Experiments on naturalistic language listening fMRI data show that language associated with body, family, and action tends to be brain-specific, whereas colloquial expressions tend to be LM-specific. Brain-LM crosscoders compare biological and artificial language representations at the feature level, which will contribute to scientific discovery in both neuroscience and artificial neural network research.
Submission Number: 644
Loading