H3T: Efficient Integration of Memory Optimization and Parallelism for Large-scale Transformer Training

Yuzhong Wang; Xu Han; Weilin Zhao; Guoyang Zeng; Zhiyuan Liu; Maosong Sun

H3T: Efficient Integration of Memory Optimization and Parallelism for Large-scale Transformer Training

Yuzhong Wang, Xu Han, Weilin Zhao, Guoyang Zeng, Zhiyuan Liu, Maosong Sun

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX

Keywords: ML System, Parallelism Learning, Memory Optimization, Data Parallelism, Model Parallelism, Parameter Parallelism, ZeRO, Rematerialization, Checkpointing, Tensor Offloading, Dynamic Programming

TL;DR: A framework for High-Throughput Transformer Training (H3T) that can train GPT-3-175B using only 64 NVIDIA A100 GPUs. The implementation can find efficient integration of memory optimization and parallelism for actual training of Transformers.

Abstract: In recent years, big models based on Transformers have achieved state-of-the-art performance on many artificial intelligence (AI) tasks. Despite the success of these Transformer-based models, their huge parameter size poses a serious challenge to their training, both from the storage and computation perspectives. To this end, memory optimization (e.g., rematerialization and offloading) and parallelism (e.g., data parallelism and model parallelism) are widely explored to make training Transformers more efficient. In this paper, we propose a framework to automatically find an efficient integration of memory optimization and parallelism for High-Throughput Transformer Training (named H3T), which is rarely considered by existing efforts for training big Transformer-based models. Specifically, we design search algorithms to combine appropriate memory optimization strategies and parallelism schemes to achieve a balance between memory overhead and training efficiency. We implement H3T based on an open-source toolkit BMTrain and then use H3T to train the Transformers of different sizes to evaluate the efficiency of H3T. The experimental results show that H3T outperforms the most popular deep learning (DL) toolkit Megatron-DeepSpeed by $1.2\times \sim 4.3\times$ training speed while reducing $34.6\% \sim 80.5\%$ of memory overhead. Moreover, H3T can use only 64 NVIDIA A100 GPUs to train GPT-3-175B, which is very difficult for existing DL toolkits. The source code is available at https://github.com/OpenBMB/BMTrain/tree/h3t.

Submission Number: 3012

Loading