Keywords: All-MLP, Sequence Modeling, Multilayer Perceptron, Transformer
TL;DR: We present Causal Relation Networks (CausalRNs), the first all-MLP sequence modeling architecture with linear-time parallel training.
Abstract: We present Causal Relation Networks (CausalRNs), the first all-MLP sequence modeling architecture with linear-time parallel training.
To enable autoregressive modeling, we made Relation Networks (RNs) equivariant and causal through relaxation and masking.
Contrary to the earlier belief that RNs are quadratic-time, we show that when using exp(x) as the activation function, any RN is linear-time, fully parallelizable, and numerically stable.
Our derivation spontaneously gave rise to familiar design choices adopted by state-of-the-art architectures, e.g. exponential gating and state expansion.
Such duality provided a new perspective, from which we not only validated popular design choices, but also discovered new design considerations.
Experiments on autoregressive language modeling and image classification showed CausalRNs to be comparable to Linear Transformers.
The quadratic variant of CausalRNs achieved perfect retrieval on the copying task, which was previously only possible with Transformers.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13780
Loading