Keywords: Impossibility Result, Transformers, Attention mechanism, Graphs, Theory, Induction heads, In-context learning
TL;DR: We prove a dimension-independent impossibility result for single-head transformers and study the representational limits of attention via a graph-based task.
Abstract: In this paper, we establish a dimension- and precision-independent impossibility result for a simplified transformer model. Due to their size, a comprehensive understanding of the internal operations of frontier large language models (LLMs) is beyond the reach of current methods, but research into small and interpretable models has proven successful. We study the representational limits of attention, the core of transformer models, through the lens of the Endpoint Selection Problem (ESP), a simple yet expressive learning task defined over arcs of a directed graph. ESP is closely related to the 2-hop induction head problem studied in prior work, which itself can be formalized as a function composition task.
Our main theoretical results are twofold: (i) no 1-head, 1-layer, attention-only transformer can solve ESP on any graph containing a cycle, even with unbounded dimension and precision; (ii) in contrast, a 2-head, 1-layer, attention-only transformer can solve ESP on arbitrary directed graphs with constant embedding dimension and logarithmic precision. Prior lower bounds were conditional on bounds on dimension and precision. We complement our 1-head result by showing that, while a zero-error model exists for any directed acyclic graph, it is NP-complete to even approximate the best single-head model that minimizes error on the arcs of an arbitrary directed graph.
Finally, we validate our theory with experiments and observe that gradient-based optimization can reliably find 1-head solutions for DAGs and 2-head solutions for arbitrary graphs with cycles, whereas 1-head models struggle to reach the optimal solution in graphs with cycles. We believe that our techniques are of independent interest and have the potential to establish a new fine-grained hierarchy of transformer architectures, each with greater problem-solving power than the last.
Supplementary Material: zip
Primary Area: learning on graphs and other geometries & topologies
Submission Number: 20848
Loading