Self-Interpretable Concept Representations: Training Lightweight Adapters on Vector-Label Pairs

Published: 01 Mar 2026, Last Modified: 01 Mar 2026UCRL@ICLR2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: self-interpretation, mechanistic interpretability, sparse autoencoders, activation patching, Patchscopes, implicit reasoning
TL;DR: Train a tiny adapter on labeled activation vectors or SAE labels to make frozen language models reliably describe their own activations.
Abstract: Self-interpretation methods prompt language models to describe their own internal states, offering a path toward concept-based self-explanation, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on learned concept representations, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder concept labels that outperform the training labels themselves on generation scoring, a concept quality metric (71% vs 63% at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and surface semantic concepts implicit in multi-hop reasoning, including bridge entities appearing in neither prompt nor response, without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that faithful self-explanation of learned concepts improves with scale, without modifying the model being interpreted.
Submission Number: 34
Loading