Open-Vocabulary Natural-Language Explanations of LLM Activations via Soft Prompts

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, Automated interpretability, Probing
TL;DR: L2EL maps activations to soft prompts so a frozen LLM emits SAE explanations
Abstract: We introduce Latent-to-Explanation Likelihood (L2EL), a simple interface that translates internal activations of a large language model (LLM) into a short natural-language explanation without changing the underlying LLM. Given a single hidden representation (e.g., a residual-stream activation at one token and layer), a tiny mapper produces a continuous ``soft prompt'' that conditions a frozen LLM to generate an explanation. We train the mapper with weak supervision from sparse autoencoder (SAE) explanations: for each latent we sample one natural language feature description among the active SAE features and optimize the soft prompt such that the LLM emits that description when conditioned on it. At test time, L2EL supports (i) generation of concise free-form explanations and (ii) probing by scoring arbitrary hypotheses through their conditional likelihood. This reframes interpretability as conditional language modeling over explanations, enabling an open vocabulary and calibration through likelihoods. As a proof-of-concept, we train L2EL on Gemma-2-2B using GemmaScope SAEs. Our results indicate that L2EL generates reasonable explanations and can be used to probe hidden activations using natural language. L2EL preserves the strengths of language as an expressive medium while requiring only a small learned interface and no modifications to the LLM.
Submission Number: 156
Loading