Enhancing Stability for Large Models Training in Constrained Bandwidth Networks

Yun Dai; Tejas Dharamsi; Pin-Lun Hsu; Tao Song; Hamed Firooz

Enhancing Stability for Large Models Training in Constrained Bandwidth Networks

Yun Dai, Tejas Dharamsi, Pin-Lun Hsu, Tao Song, Hamed Firooz

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: ZeRO, Large Language Models, Efficient Training, Scalable Distributed Training

Abstract: Training extremely large language models with billions of parameters is a computationally inten- sive task that pushes the limits of current data- parallel training systems. While techniques like ZeRO++ (Wang et al., 2024) have enabled effi- cient distributed training of such giant models on inexpensive low-bandwidth clusters, they can suffer from convergence issues due to potential race conditions in the hierarchical partitioning (hpZ) scheme employed to reduce cross-machine communication. In this work, we first show how these race conditions cause instability when train- ing models with billions of parameters. We then propose a modification to the partitioning algo- rithm that addresses these convergence challenges while maintaining competitive training efficiency. Empirical evaluation on training the multi-billion parameters Falcon Models and LLama-2 models demonstrates the updated algorithm’s ability to achieve reliable convergence on these massive models, where stock ZeRO++ hpZ fails to con- verge. The updated algorithm enables robust train- ing of larger models with 98% throughput and model training speed improvement without sacri- ficing the quality of convergence.

Submission Number: 16

Loading