Two (narrow) heads are better than (an arbitrarily wide) one

Two (narrow) heads are better than (an arbitrarily wide) one

ICLR 2026 Conference Submission20848 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Impossibility Result, Transformers, Attention mechanism, Graphs, Theory, Induction heads, In-context learning

TL;DR: We prove a dimension-independent impossibility result for single-head transformers and study the representational limits of attention via a graph-based task.

Abstract: In this paper, we establish a dimension- and precision-independent impossibility result for a simplified transformer model. Due to their size, a comprehensive understanding of the internal operations of frontier large language models (LLMs) is beyond the reach of current methods, but research into small and interpretable models has proven successful. We study the representational limits of attention, the core of transformer models, through the lens of the Endpoint Selection Problem (ESP), a simple yet expressive learning task defined over arcs of a directed graph. ESP is closely related to the 2-hop induction head problem studied in prior work, which itself can be formalized as a function composition task. Our main theoretical results are twofold: (i) no 1-head, 1-layer, attention-only transformer can solve ESP on any graph containing a cycle, even with unbounded dimension and precision; (ii) in contrast, a 2-head, 1-layer, attention-only transformer can solve ESP on arbitrary directed graphs with constant embedding dimension and logarithmic precision. Prior lower bounds were conditional on bounds on dimension and precision. We complement our 1-head result by showing that, while a zero-error model exists for any directed acyclic graph, it is NP-complete to even approximate the best single-head model that minimizes error on the arcs of an arbitrary directed graph. Finally, we validate our theory with experiments and observe that gradient-based optimization can reliably find 1-head solutions for DAGs and 2-head solutions for arbitrary graphs with cycles, whereas 1-head models struggle to reach the optimal solution in graphs with cycles. We believe that our techniques are of independent interest and have the potential to establish a new fine-grained hierarchy of transformer architectures, each with greater problem-solving power than the last.

Supplementary Material: zip

Primary Area: learning on graphs and other geometries & topologies

Submission Number: 20848

Loading