MQSP: Micro-Query Sequence Parallelism for Linearly Scaling Long Sequence Transformer

Ying Zhong; Jianjiang Zhu; PengCheng Yang; Xiaoming Zhang; Ke Zhang

MQSP: Micro-Query Sequence Parallelism for Linearly Scaling Long Sequence Transformer

Ying Zhong, Jianjiang Zhu, PengCheng Yang, Xiaoming Zhang, Ke Zhang

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone

Keywords: Sequence parallelism, Long Sequence Transformer, Distributed training

Abstract: Long sequence modeling of Transformer gains prevalence in fields involving long texts and high-resolution images and videos but suffers from quadratic memory complexity. Existing work investigates low-complexity variants or parallel methods to handle it. The former attempts to approximate full attention and is limited by a single device's capacity. The latter struggles to manage quadratic memory of attention maps, leading to insufficient sequence scalability. In this work, we propose a novel parallel method named $\textbf{M}$icro-$\textbf{Q}$uery $\textbf{S}$equence $\textbf{P}$arallelism. MQSP slices sequences across devices and projects local queries, keys, and values in self-attention. For communication and memory efficiency, MQSP all-gathers the queries while keys and values remain locally to acquire the local attention map, on which a distributed softmax gets conducted to amortize memory by column. Meanwhile, the queries get further partitioned as Micro-Q to divide the computation and recycle the attention map by row, jointly decomposing the quadratic memory to achieve linear scalability. The evaluation result shows that MQSP scales up sequence length linearly and achieves 4.5$\times$ sequence length of ColossalAI's sequence parallelism and 4.3$\times$ of Megatron-LM3, enabling training BERT-large of 78848 sequence length on 32 A100 GPUs. MQSP can reduce up to 78.6$\%$ of memory occupation and achieve up to 3.3$\times$ throughput when training on 17408 sequence length. The convergence quality experiment proves that MQSP provides the means for long sequences with guaranteed convergence, bringing the potential for the Transformer to explore longer sequences.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Infrastructure (eg, datasets, competitions, implementations, libraries)

TL;DR: MQSP is a novel sequence parallelism that linearly scales long sequence Transformers through all-gathering Micro-Q.

Supplementary Material: zip

9 Replies

Loading