LightSeq: : Sequence Level Parallelism for Distributed Training of Long Context Transformers

Published: 28 Oct 2023, Last Modified: 01 Dec 2023WANT@NeurIPS 2023 PosterEveryoneRevisionsBibTeX
Keywords: Distributed Large language models training, long context, sequence parallelism, recomputation, overlap communication
TL;DR: An scalable and efficient training sequence-parallel system for long-context transformer, optimized for causal language modeling objective.
Abstract: Increasing the context length of large language models (LLMs) unlocks fundamentally new capabilities, but also significantly increases the memory footprints of training. Previous model-parallel systems such as Megatron-LM partition and compute different attention heads in parallel, resulting in large communication volumes, so they cannot scale beyond the number of attention heads, thereby hindering its adoption. In this paper, we introduce a new approach, LightSeq, for long-context LLMs training. LightSeq has many notable advantages. First, LightSeq partitions over the sequence dimension, hence is agnostic to model architectures and readily applicable for models with varying numbers of attention heads, such as Multi-Head, Multi-Query and Grouped-Query attention. Second, LightSeq not only requires up to 4.7× less communication than Megatron-LM on popular LLMs but also overlaps the communication with computation. To further reduce the training time, LightSeq features a novel gradient checkpointing scheme to bypass an forward computation for memory-efficient attention. We evaluate LightSeq on Llama-7B and its variants with sequence lengths from 32K to 512K. Through comprehensive experiments on single and cross-node training, we show that LightSeq achieves up to 1.24-2.01× end-to-end speedup, and a 2-8× longer sequence length on models with fewer heads, compared to Megatron-LM. Codes are available at
Submission Number: 19