Keywords: Applications of interpretability, Probing, Other
TL;DR: We stress test techniques for decoding activations into natural language, discovering the lack of proper evals and baselines. Verbalization methods instead give an illusion of interpretable model mechanisms rather than reveal privileged knowledge.
Abstract: Several recent interpretability methods have proposed to convert a target LLM's internal representations into natural language descriptions using a second LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such "activation verbalization" approaches actually provide $\textit{privileged}$ knowledge about the internal workings of the target model, or do they merely convey information about the input prompt given to it? We critically evaluate previously proposed verbalization methods across datasets used in previous work and find that one can achieve strong performance without any access to target model internals. This suggests that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that generated descriptions often reflect the parametric knowledge of the LLM used to generate them, rather than the activations of the target LLM being decoded. Taken together, our results indicate a need for more focused tasks and experimental controls to rigorously assess whether verbalization provides meaningful insights into the operations of LLMs.
Submission Number: 145
Loading