Do Natural Language Descriptions of Model Activations Convey Privileged Information?

Millicent Li; Alberto Mario Ceballos Arroyo; Giordano Rogers; Naomi Saphra; Byron C Wallace

Do Natural Language Descriptions of Model Activations Convey Privileged Information?

Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, Byron C Wallace

Published: 30 Sept 2025, Last Modified: 03 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Applications of interpretability, Probing, Other

TL;DR: We stress test techniques for decoding activations into natural language, discovering the lack of proper evals and baselines. Verbalization methods instead give an illusion of interpretable model mechanisms rather than reveal privileged knowledge.

Abstract: Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second $\textit{verbalizer}$ LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such $\textit{activation verbalization}$ approaches actually provide $\textit{privileged}$ knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they can succeed at benchmarks without any access to target model internals, suggesting that these datasets may not be ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

Submission Number: 145

Loading