LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers

Dacheng Li; Rulin Shao; Anze Xie; Eric Xing; Joseph E. Gonzalez; Ion Stoica; Xuezhe Ma; Hao Zhang

LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers

Dacheng Li, Rulin Shao, Anze Xie, Eric Xing, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, Hao Zhang

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: infrastructure, software libraries, hardware, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Distributed Large language models training, long context, sequence parallelism, recomputation, overlap communication

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Increasing the context length of large language models (LLMs) unlocks fundamentally new capabilities, but also significantly increases the memory footprints of training. Previous model-parallel systems such as Megatron-LM partition and compute different attention heads in parallel, resulting in large communication volumes, so they cannot scale beyond the number of attention heads, thereby hindering its adoption. In this paper, we introduce a new approach, LightSeq, for long-context LLMs training. LightSeq has many notable advantages. First, LightSeq partitions over the sequence dimension, hence is agnostic to model architectures and readily applicable for models with varying numbers of attention heads, such as Multi-Head, Multi-Query and Grouped-Query attention. Second, LightSeq not only requires up to 4.7× less communication than Megatron-LM on popular LLMs but also overlaps the communication with computation. To further reduce the training time, LightSeq features a novel gradient checkpointing scheme to bypass an forward computation for memory-efficient attention. We evaluate LightSeq on Llama-7B and its variants with sequence lengths from 32K to 512K. Through comprehensive experiments on single and cross-node training, we show that LightSeq achieves up to 1.24-2.01× end-to-end speedup, and a 2-8× longer sequence length on models with fewer heads, compared to Megatron-LM. Anonymous codes available at https://anonymous.4open.science/r/lightseq-anonymized.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: pdf

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1216

Loading