Keywords: Transformer, RNN, Bidirectionality, SSM, Linear Attention, Inference, Training
TL;DR: We propose LION, a framework for extending Linear Transformers to the bidirectional setting by providing three theoretically equivalent representations: full attention, bidirectional RNN, and chunkwise parallel form.
Abstract: Linear Transformers and State Space Models have emerged as efficient alternatives
to softmax Transformers for causal sequence modeling, enabling parallel training
via matrix multiplication and efficient RNN-style inference. However, despite their
success in causal tasks, no unified framework exists for applying Linear Transformers to bidirectional sequence modeling. We introduce LION, the first framework to
systematically extend Linear Transformers to the bidirectional setting. LION generalizes three core representations commonly used in the causal case—full Linear
Attention , bidirectional RNN, and chunkwise parallel form—to the bidirectional
setting. These forms are theoretically equivalent and enable models to exploit the
strengths of each during training and inference. We prove that a broad class of
Linear Transformers can be extended using LION and validate our framework via
three core examples based on the choice of decay type: LION-LIT, the bidirectional
extension of [25]; LION-D, based on [44]; and LION-S, a variant using selective
decay [34, 13]. Across standard bidirectional tasks, LION enables models to match
or exceed the performance of softmax Transformers, while offering significantly
faster training and more efficient inference than existing State Space Models.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 23483
Loading