LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Alexander Pan; Lijie Chen; Jacob Steinhardt

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Alexander Pan, Lijie Chen, Jacob Steinhardt

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Safety, Activation Engineering, Top-Down Transparency of Language Models

Abstract: Top-down transparency typically analyzes language model activations using probes with scalar or single-token outputs, limiting the range of behaviors that can be captured. To alleviate this issue, we develop a more expressive probe that can directly output natural language and perform LatentQA: the task of answering open-ended questions about activations. A key difficulty in developing such a probe is collecting a dataset mapping activations to natural-language descriptions. In response, we propose an approach for generating a pseudo-labeled dataset of activations and associated question-answer pairs and develop a fine-tuning method for training a decoder LLM on this dataset. We then validate our decoder’s fidelity by assessing its ability to read and steer model activations. First, we evaluate the decoder on a number of supervised reading tasks with a known answer, such as uncovering hidden system prompts and relational knowledge extraction, and observe that it outperforms competitive probing baselines. Second, we demonstrate that the decoder is precise enough to steer the target model to exhibit behaviors unseen during training. Finally, we show that LatentQA scales well with increasing dataset and model size, which is promising given how easily our approach can generate additional pseudo-labels.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 13761

Loading