Tracr: Compiled Transformers as a Laboratory for Interpretability

David Lindner; Janos Kramar; Sebastian Farquhar; Matthew Rahtz; Thomas McGrath; Vladimir Mikulik

Tracr: Compiled Transformers as a Laboratory for Interpretability

David Lindner, Janos Kramar, Sebastian Farquhar, Matthew Rahtz, Thomas McGrath, Vladimir Mikulik

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 spotlightEveryoneRevisionsBibTeX

Keywords: interpretability, transformers, language models, RASP, Tracr, mechanistic interpretability

TL;DR: Compiling human-readable programs into weights of a transformer model to accelerate interpretability research.

Abstract: We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as _ground-truth_ for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.

Supplementary Material: zip

Submission Number: 12093

Loading