Abstract: Token mixing layers play a key role in how language models can learn
and generate long-range dependencies. Their efficiency relies on the
necessary trade-off between decoding speed and the memory
requirements, along with the cache size. Considering causal generation, this paper explores new trade-offs
thanks to a unified framework which separates two crucial
features: (i) the direct influence of inputs on
outputs in one generation step; (ii) the recurrent propagation of
information through past outputs.
This framework encompasses major architectures such as attention and
state-space models, but also generalizes the recurrence equations by
allowing each state to depend on multiple past states rather than
only the immediate predecessor. By introducing structure, we design
new recurrence patterns that provably achieve the desired
complexity, while providing theoretical insights on their
expressivity -- trading runtime for expressivity in a principled
way. Empirical validation is performed on synthetic tasks, along
with language modeling. Together, these results provide a
unified toolkit for the understanding and design of efficient and
expressive token mixers across model families.
Lay Summary: Modern AI systems process text by combining information from many previous words. Current methods face an important trade-off: some are powerful but computationally expensive, while others are faster but struggle to remember long-range information. This limitation becomes especially important as language models are asked to handle increasingly long documents and conversations.
In this work, we introduce a general framework that helps explain and compare different ways AI models mix information across text. Our framework shows that many existing architectures can be understood through the same underlying principles, while also enabling new designs that balance efficiency and expressiveness in different ways.
Using this framework, we develop several new token-mixing strategies that reduce computational costs while preserving the ability to capture long-range dependencies. We also provide theoretical tools to analyze how well these methods can transmit and store information over time.
Experiments on synthetic memory tasks and language modeling demonstrate that carefully structured models can achieve much of the performance of expensive attention mechanisms at a fraction of the computational cost. These results offer new directions for designing faster and more scalable language models.
Primary Area: Deep Learning
Keywords: token mixing, linear recurrence
Originally Submitted PDF: pdf
Submission Number: 34554
Loading