StagFormer:  A Staggered Transformer for Decoding Layers in Parallel

Dylan J Cutler; Arun Kandoor; Nishanth Dikkala; Xin Wang; Nikunj Saunshi; Rina Panigrahy

StagFormer: A Staggered Transformer for Decoding Layers in Parallel

Dylan J Cutler, Arun Kandoor, Nishanth Dikkala, Xin Wang, Nikunj Saunshi, Rina Panigrahy

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: decoder only language models, transformers, staggered execution, pipelining, parallel decoding, efficiency

Abstract: Standard decoding in a Transformer based language model is inherently sequential as we wait for a token’s embedding to pass through all the layers in the network before starting the generation of the next token. In this work, we propose anew architecture StagFormer (Staggered Transformer), which staggered execution along the time axis and thereby enables parallelizing the decoding process along the depth of the model. We achieve this by breaking the dependency of the token representation at time step $i$ in layer $l$ upon the representations of tokens until time step $i$ from layer $l−1$. Instead, we stagger the execution and only allow a dependency on token representations until time step $i−1$. The later sections of the Transformer still get access to the ”rich” representations from the prior section but only from those token positions which are one time step behind. StagFormer allows for different sections of the model to be executed in parallel yielding up to 33% speedup in decoding while being quality neutral. We also explore many natural variants of this idea. We present how weight-sharing across the different sections being staggered can be more practical in settings with limited memory. We show how one can approximate a recurrent model during inference using such weight-sharing. We explore the efficacy of using a bounded window attention to pass information from one section to another which helps drive further latency gains for some applications. We also explore demonstrate the scalability of the staggering idea over more than 2 sections of the Transformer.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4637

Loading