Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Token mixing layers play a key role in how language models can learn and generate long-range dependencies. Their efficiency relies on the necessary trade-off between decoding speed and the memory requirements, along with the cache size. Considering causal generation, this paper explores new trade-offs thanks to a unified framework which separates two crucial features: (i) the direct influence of inputs on outputs in one generation step; (ii) the recurrent propagation of information through past outputs. This framework encompasses major architectures such as attention and state-space models, but also generalizes the recurrence equations by allowing each state to depend on multiple past states rather than only the immediate predecessor. By introducing structure, we design new recurrence patterns that provably achieve the desired complexity, while providing theoretical insights on their expressivity -- trading runtime for expressivity in a principled way. Empirical validation is performed on synthetic tasks, along with language modeling. Together, these results provide a unified toolkit for the understanding and design of efficient and expressive token mixers across model families.
Lay Summary: Modern AI systems process text by combining information from many previous words. Current methods face an important trade-off: some are powerful but computationally expensive, while others are faster but struggle to remember long-range information. This limitation becomes especially important as language models are asked to handle increasingly long documents and conversations. In this work, we introduce a general framework that helps explain and compare different ways AI models mix information across text. Our framework shows that many existing architectures can be understood through the same underlying principles, while also enabling new designs that balance efficiency and expressiveness in different ways. Using this framework, we develop several new token-mixing strategies that reduce computational costs while preserving the ability to capture long-range dependencies. We also provide theoretical tools to analyze how well these methods can transmit and store information over time. Experiments on synthetic memory tasks and language modeling demonstrate that carefully structured models can achieve much of the performance of expensive attention mechanisms at a fraction of the computational cost. These results offer new directions for designing faster and more scalable language models.
Primary Area: Deep Learning
Keywords: token mixing, linear recurrence
Originally Submitted PDF: pdf
Submission Number: 34554
Loading