Linear Recurrent Neural Networks with a Feature-Sequence Twist

Matthew James Tilley; David Freedman

Linear Recurrent Neural Networks with a Feature-Sequence Twist

Matthew James Tilley, David Freedman

27 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: linear recurrent neural networks, RNNs, Sequence model, FST

TL;DR: We introduce Feature-Sequence Twisting (FST), a method that transposes sequence and feature dimensions between linear recurrent neural network blocks, enabling deeper sequence representations.

Abstract: The transformer network architecture has led to advances in artificial intelligence. Conversational AI applications, such as ChatGPT, and protein folding predictions with AlphaFold are made possible by transformer architectures and the self-attention mechanism. However, advancing towards more general, flexible, and energy-efficient artificial intelligence may require exploring new architectures that differ significantly from those currently used. Transformer networks have largely replaced recurrent neural networks (RNNs) for state-of-the-art performance on sequence-based tasks. However, in recent years there has been some successful competition from linear recurrent neural networks (LRNNs) and state space models (SSMs). A core advantage of LRNNs and SSMs over traditional RNNs is that the hidden states can be calculated in parallel. Therefore, like the transformer, they can make efficient use of GPU computation. Unlike the transformer, computational costs of parallelized LRNNs and SSMs can scale sub-quadratically with sequence length. Despite these advantages, LRNNs and SSMs often struggle to generate the deep and rich representations that have contributed to the success of transformer architectures. We introduce Feature-Sequence Twisting (FST), a novel technique that transposes the sequence and feature dimensions between LRNN blocks. The purpose of FST is to generate deeper representations of the sequence in subsequent LRNN blocks. Since the computational cost of LRNNs scale sub-quadratically with sequence length, FST remains practical to compute even for large feature dimensions. Our experiments demonstrate that the FST architecture outperforms transformer networks on tasks such as Long ListOps, achieving performance competitive with state-of-the-art models.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11286

Loading