Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025EveryoneRevisionsBibTeXCC BY-NC 4.0
Track: Extended Abstract Track
Keywords: concept representation, concept alignment, interpretability
TL;DR: We propose a novel way to study human-LLM alignment based on expert neurons, which provides more insight into LLM representations than embeddings-based approaches
Abstract: Modern large language models (LLMs) achieve impressive performance on some tasks, while exhibiting distinctly non-human-like behaviors on others. This raises the question of how well the LLM’s learned representations align with human rep- resentations. In this work, we introduce a novel approach to study representation alignment: we adopt an activation steering method to identify neurons responsible for specific concepts (e.g., “cat”) and then analyze the corresponding activation patterns. We find that LLM representations captured this way closely align with human representations inferred from behavioral data, matching inter-human align- ment levels. Our approach significantly outperforms the alignment captured by word/sentence embeddings, which have been the focus of prior work on human- LLM alignment. Additionally, our approach enables a more granular view of how LLMs represent concepts — we show that LLMs organize concepts in a way that mirrors human concept organization.
Submission Number: 56
Loading