Keywords: Language models, Graph algorithms, interpretability
TL;DR: We train 2-layer transformers to predict shortest paths on simple connected graphs and reverse-engineer the algorithm they learn.
Abstract: Decoder-only transformers lead to a step-change in capability of large language models. However, opinions are mixed as to whether they are really planning or reasoning. A path to making progress in this direction is to study the model's behavior in a setting with carefully controlled data. Then interpret the learned representations and reverse-engineer the computation performed internally. We study decoder-only transformer language models trained from scratch to predict shortest paths on simple, connected and undirected graphs. In this setting the representations and the algorithm learned by the model are completely interpretable. We present three major results: (1) Two-layer decoder-only language models can learn to predict shortest paths on simple, connected graphs containing up to $10$ nodes. (2) Models learn a graph embedding that is correlated with the spectral decomposition of the \emph{line graph}. (3) A new, approximate path-finding algorithm Spectral Line Navigator that relies on the spectral decomposition of the line graph to compute shortest paths.
Submission Number: 17
Loading