Semantic Directions Emerge from Sparse Token Mixtures

Semantic Directions Emerge from Sparse Token Mixtures

ACL ARR 2026 January Submission3698 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Interpretability, Semantic Directions, Sparse Representation, Token Embeddings, Linear Probes, Brain-Inspired AI, fMRI Alignment

Abstract: Linear probes reveal that semantic properties are linearly separable in LLM hidden representations, yet where these semantic directions originate and what they are composed of remain underexplored. Inspired by sparse population coding in the brain---where concepts are represented by activating only $\sim$1\% of neurons---we propose Token-Mixture Representation, which expresses semantic directions as sparse, non-negative linear combinations of LM head token embeddings. By applying a global Top-$K$ constraint, semantic directions are decomposed into human-readable token lists. Across four LLMs (Llama-3.1-8B, Qwen2.5-7B, GPT-OSS-20B, Phi-4) and four semantic classification tasks, we find that using only $K{=}100$ tokens (0.1\% of vocabulary) achieves 96\% of full-vocabulary performance on average. The selected tokens exhibit high interpretability, with knock-out experiments confirming 3--8$\times$ greater performance contribution compared to random tokens. Furthermore, Token-Mixture directions align significantly with Sparse Autoencoder monosemantic features ($z = 12.32$, $p < 0.001$), confirming capture of genuine semantic structure rather than superficial lexical patterns. Finally, fMRI alignment analysis achieves a meta-correlation of 0.76 with human brain concreteness sensitivity patterns, with strong alignment in known semantic processing regions. These findings demonstrate that LLM semantic directions are represented as sparsely in vocabulary space as concepts are in the brain, opening new possibilities for interpretable representation engineering.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: probing, interpretability, representation learning, linear probes, sparse representations, model analysis, semantic representations

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: English, Chinese, Spanish, French, German

Submission Number: 3698

Loading