Teaching LLMs to Decode Activations Into Natural Language

Alexander Pan; Lijie Chen; Jacob Steinhardt

Teaching LLMs to Decode Activations Into Natural Language

Alexander Pan, Lijie Chen, Jacob Steinhardt

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm safety, llm interpretability, ml safety, activation steering

TL;DR: We train an LLM to answer open-ended questions about model latents and use it for interpretability and control

Abstract: Interpretability methods seek to understand language model representations, yet the outputs of most such methods---circuits, vectors, scalars---are uninterpretable, requiring further effort to interpret. In contrast, we propose to study LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we extend LatentQA to reveal harmful model capabilities, such as generating recipes for bioweapons and code for hacking.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12589

Loading