Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans
Abstract: Modern large language models (LLMs) achieve impressive performance on some tasks, while exhibiting distinctly non-human-like behaviors on others. This raises the question of how well the LLM's learned representations align with human representations. In this work, we introduce a novel approach to study representation alignment: we adopt a method from research on activation steering to identify neurons responsible for specific concepts (e.g., ``cat'') and then analyze the corresponding activation patterns. We find that LLM representations captured this way closely align with human representations inferred from behavioral data, matching inter-human alignment levels. Our approach significantly outperforms the alignment captured by word embeddings, which have been the focus of prior work on human-LLM alignment. Additionally, our approach enables a more granular view of how LLMs represent concepts -- we show that LLMs organize concepts in a way that mirrors human concept organization.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: hierarchical & concept explanations; knowledge tracing/discovering/inducing
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4797
Loading