Abstract: Transformer-based models like large language models (LLMs) have attracted significant attention in recent years due to their superior performance. A long sequence of input tokens is essential for industrial LLMs to provide better user services. However, memory consumption increases quadratically with the increase of sequence length, posing challenges for scaling up long-sequence training. Current parallelism methods produce duplicated tensors during execution, leaving space for improving memory efficiency. Additionally, tensor parallelism (TP) cannot achieve effective overlap between computation and communication. To solve these weaknesses, we propose a general parallelism method called memory-efficient tensor parallelism (METP), designed for the computation of two consecutive matrix multiplications and a possible function between them (O = f(AB)C), which is the kernel computation component in Transformer training. METP distributes subtasks of computing O to multiple devices and uses send/recv instead of collective communication to exchange submatrices for finishing the computation, avoiding producing duplicated tensors. We also apply the double buffering technique to achieve better overlap between computation and communication. We present the theoretical condition of full overlap to help instruct the long-sequence training of Transformers. Suppose the parallel degree is p; through theoretical analysis, we prove that METP provides O(1/p3) memory overhead when not using FlashAttention to compute attention and could save at least 41.7% memory compared to TP when using FlashAttention to compute multi-head self-attention. Our experimental results demonstrate that METP can increase the sequence length by 2.38–2.99 times compared to other methods when using eight A100 graphics processing units (GPUs).
Loading