Keywords: LLM Interpretability, Semantic Directions, Sparse Representation, Token Embeddings, Linear Probes, Brain-Inspired AI, fMRI Alignment
Abstract: Linear probes reveal that semantic properties are linearly separable in LLM hidden representations, yet where these semantic directions originate and what they are composed of remain underexplored. Inspired by sparse population coding in the brain---where concepts are represented by activating only $\sim$1\% of neurons---we propose Token-Mixture Representation, which expresses semantic directions as sparse, non-negative linear combinations of LM head token embeddings. By applying a global Top-$K$ constraint, semantic directions are decomposed into human-readable token lists. Across four LLMs (Llama-3.1-8B, Qwen2.5-7B, GPT-OSS-20B, Phi-4) and four semantic classification tasks, we find that using only $K{=}100$ tokens (0.1\% of vocabulary) achieves 96\% of full-vocabulary performance on average. The selected tokens exhibit high interpretability, with knock-out experiments confirming 3--8$\times$ greater performance contribution compared to random tokens. Furthermore, Token-Mixture directions align significantly with Sparse Autoencoder monosemantic features ($z = 12.32$, $p < 0.001$), confirming capture of genuine semantic structure rather than superficial lexical patterns. Finally, fMRI alignment analysis achieves a meta-correlation of 0.76 with human brain concreteness sensitivity patterns, with strong alignment in known semantic processing regions. These findings demonstrate that LLM semantic directions are represented as sparsely in vocabulary space as concepts are in the brain, opening new possibilities for interpretable representation engineering.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: probing, interpretability, representation learning, linear probes, sparse representations, model analysis, semantic representations
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English, Chinese, Spanish, French, German
Submission Number: 3698
Loading