Identifying latent algorithms in Transformers via RASP

Published: 11 Jun 2026, Last Modified: 16 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Methods (probing, steering, causal interventions), Concept Discovery (e.g., SAEs, dictionary learning)
TL;DR: Transformers trained on algorithmic tasks exhibit representations of RASP computations
Abstract: Transformers trained on certain algorithmic tasks have been found to generalize on out-of-distribution (OOD) examples. In this work, we identify several causal mechanisms (``latent algorithms'') that are responsible for OOD generalization. Specifically, given an algorithmic task, we use intermediate computations from a RASP-L program that implements the task to probe the alignment of the model's learned representations with the RASP-L program. For several algorithms, the intermediate computations predicted by the RASP-L program are linearly decodable from the model's activations. Causal intervention analysis reveals that the probe subspaces are crucial for high task accuracy. Overall, we take a new perspective on understanding the hidden computations and OOD generalization of Transformer language models.
Submission Number: 123
Loading