Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans

Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans

ACL ARR 2025 May Submission4797 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Modern large language models (LLMs) achieve impressive performance on some tasks, while exhibiting distinctly non-human-like behaviors on others. This raises the question of how well the LLM's learned representations align with human representations. In this work, we introduce a novel approach to study representation alignment: we adopt a method from research on activation steering to identify neurons responsible for specific concepts (e.g., ``cat'') and then analyze the corresponding activation patterns. We find that LLM representations captured this way closely align with human representations inferred from behavioral data, matching inter-human alignment levels. Our approach significantly outperforms the alignment captured by word embeddings, which have been the focus of prior work on human-LLM alignment. Additionally, our approach enables a more granular view of how LLMs represent concepts -- we show that LLMs organize concepts in a way that mirrors human concept organization.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: hierarchical & concept explanations; knowledge tracing/discovering/inducing

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4797

Loading