Mechanistically Eliciting Latent Behaviors in Language Models

ICLR 2026 Conference Submission19924 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, steering vectors, tensor decompositions
Abstract: We aim to discover diverse, generalizable perturbations of LLM internals that can surface hidden behavioral modes. Such perturbations could help reshape model behavior and systematically evaluate potential risks. We introduce \emph{Deep Causal Transcoding} (DCT), an unsupervised method for discovering interpretable steering vectors that can elicit these latent behaviors. DCTs learn a shallow MLP approximation of a deep transformer slice using a heuristic generalization of existing tensor decomposition algorithms. DCTs exhibit remarkable data efficiency, learning a large number of interpretable features from a \emph{single example}. We document empirical {\it enumerative scaling laws}, finding that DCTs more efficiently enumerate natural behaviors than do random steering vectors. We show that DCT vectors increase the variety of behaviors elicited by open-ended conversational prompts, and even lead to moderately more sample-efficient exploration on reasoning problems, improving pass@2048 accuracy by 4\% on AIME25 using Deepseek-R1-Distill-Qwen-14B. We also demonstrate partial overlap with sparse auto-encoder (SAE) features, providing an external source of evidence for the validity of our feature detection method. By providing a data-efficient method to systematically explore the space of latent model behaviors, DCTs yield a powerful tool for aligning AI systems and for evaluating their safety.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 19924
Loading