Keywords: Methods (probing, steering, causal interventions)
Other Keywords: MLP activations, semantic relations
TL;DR: This paper examines the way we identify relation-specific neurons in language models using contrastive activations and shows that conclusions about internal representations depend on methodological choices and prior assumptions..
Abstract: Previous work has identified relation-specific neurons that selectively activate on specific semantic relations in factual knowledge tasks. However, the conclusions we draw about these representations depend heavily on the methodological assumptions underlying this procedure. We systematically reflect on three such assumptions, showing that
(i) the number of relevant neurons varies across relations;
(ii) the choice of internal signal for neuron identification shapes the results;
(iii) cross-relation entanglement is structural rather than an artifact of subject overlap.
We additionally present a preliminary investigation into the mismatch between benchmark-defined relation categories and
model-internal organization. For instance, we show that the absence of a strong expert set for the product_company relationship reflects conceptual heterogeneity within the category rather than localization failure, and that targeted ablation of the subrelation car\_company
yields substantially stronger results. Together, our findings show that the apparent structure of relational representations is jointly shaped by the model's internal organization and the methodological lens applied to study it.
Submission Number: 432
Loading