How Do Transformers Learn Variable Binding in Symbolic Programs?

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Developmental interpretability study of a small transformer trained to perform variable binding on programs
Abstract: Variable binding---the ability to associate variables with values---is fundamental to symbolic computation and cognition. Although classical architectures typically implement variable binding via addressable memory, it is not well understood how modern neural networks lacking built-in binding operations may acquire this capacity. We investigate this by training a Transformer to dereference queried variables in symbolic programs where variables are assigned either numerical constants or other variables. Each program requires following chains of variable assignments up to four steps deep to find the queried value, and also contains irrelevant chains of assignments acting as distractors. Our analysis reveals a developmental trajectory with three distinct phases during training: (1) random prediction of numerical constants, (2) a shallow heuristic prioritizing early variable assignments, and (3) the emergence of a systematic mechanism for dereferencing assignment chains. Using causal interventions, we find that the model learns to exploit the residual stream as an addressable memory space, with specialized attention heads routing information across token positions. This mechanism allows the model to dynamically track variable bindings across layers, resulting in accurate dereferencing. Our results show how Transformer models can learn to implement systematic variable binding without explicit architectural support, bridging connectionist and symbolic approaches.
Lay Summary: Variable binding is a fundamental operation in cognition that involves associating abstract placeholders with specific values, like linking $x$ to $5$ in a math problem. Classical computers can do this by storing variables and their values in explicit memory, but it's unclear how modern neural networks, which lack built-in symbolic memory, achieve this. We investigated whether the popular neural network architecture Transformer can learn to bind variables and values independently. We trained a Transformer to solve synthetic puzzles requiring tracking chains of variable assignments. For example, given $x = 5, y = x, z = y$, what is $z$?", the model must follow the chain to determine $z$ equals $5$. Our experiments reveal that the model learns in three stages: first guessing randomly, then using simple shortcuts, and finally learning a systematic method. By analyzing its inner mechanisms, we found it repurposes parts of its structure as memory, dynamically passing information through specialized pathways. This finding shows that neural networks can spontaneously develop structured operations similar to classical symbolic systems, offering valuable insights into how advanced AI models acquire complex problem-solving skills. To help researchers explore these findings, we developed Variable Scope, an interactive web platform showcasing our experimental results.
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: variable binding, mechanistic interpretability, causal interventions, transformers, language models
Submission Number: 14043
Loading