Sub-Linear Memory: How to Make Performers SLiM

Valerii Likhosherstov; Krzysztof Marcin Choromanski; Jared Quincy Davis; Xingyou Song; Adrian Weller

Sub-Linear Memory: How to Make Performers SLiM

Valerii Likhosherstov, Krzysztof Marcin Choromanski, Jared Quincy Davis, Xingyou Song, Adrian Weller

Published: 09 Nov 2021, Last Modified: 04 May 2025NeurIPS 2021 PosterReaders: Everyone

Keywords: transformer, performer, slim-performer, memory efficient, linear transformer

TL;DR: We show that Performer architectures only require $O(1)$ memory for training as a function of sequence length $L$.

Abstract: Transformer architectures have become very popular yet the original implementation requires $O(L^2)$ in serial time and memory as functions of input length $L$. Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation. We conduct a thorough complexity analysis of Performers, a class which includes most recent linear Transformer mechanisms. We note a remarkable computational flexibility: the gradient computation can be performed with no approximations using sublinear memory as a function of $L$ (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only $O(1)$ memory, and still requires $O(L)$ time. Due to complete backward-compatibility, this discovered time-memory tradeoff can be used for fine-tuning on low-memory devices in a decentralized fashion without any server computations.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: pdf

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/sub-linear-memory-how-to-make-performers-slim/code)

11 Replies

Loading