Is Random Attention Sufficient for Sequence Modeling?

ICLR 2026 Conference Submission12829 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer architecture, expressiveness, interpretability
TL;DR: The transformer architecture has a built-in inductive bias towards forming specialized circuits.
Abstract: The transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of tasks -- including mathematical reasoning, memorization, and retrieval -- using only gradient-based learning on next-token prediction. While the core component of a transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard transformers to variants in which either the attention weights or the MLP layers are frozen at initialization. Surprisingly, we find that attention with frozen key and query weights is not only able to form induction heads, but can also perform competitively on language modeling. We formalize this by proving a new expressivity result for transformer models with frozen attention weights. To further isolate the contribution of attention, we design MixiT -- the Mixing Transformer -- an architecture variant with entirely random attention scores, with provably stable signal propagation that overcomes prior depth-wise scaling challenges in random transformers. We use the successes and failures of our spectrum of models to pinpoint the role each main transformer component plays. Our results suggest that the transformer architecture has a built-in inductive bias towards in-context reasoning, as it can form specialized circuits even without learnable attention weights.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 12829
Loading