Compiling to transformers

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmarking Interpretability, Interpretability tooling and software
Other Keywords: Programming Language Theory, Type Theory, Compiler Correctness
TL;DR: We formally specify, and prove correct, a compiler that compiles discrete algorithms into transformers.
Abstract: The representations of transformers are easy to measure but hard to understand; we cannot reliably predict or control behavior from them. To address this, recent work proposed compiling high-level programs to transformers as a tool for studying these representations. By compiling programs to transformers, the algorithms driving their representations are known, and techniques for understanding them can be verified against a ground truth. However, prior compilers make a closed program assumption. They cannot compile programs with free variables. As a consequence, the program must specify the entire algorithm a transformer implements, restricting applicability to hand-constructed transformers rather than ones trained by gradient descent. We resolve that limitation with Cajal, a language whose programs compile to transformers which are subsequently trained by gradient descent. We prove its compiler correct, and conduct experiments where we compile a conditional reverse program into the attention head of a transformer, showing that transformers can learn to use compiled programs which specify part of their behavior. Importantly, this enables verification of interpretability techniques against partial ground truths in transformers trained by gradient descent.
Submission Number: 173
Loading