ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library

Jianbo Dong, Shaochuang Wang, Fei Feng, Zheng Cao, Heng Pan, Lingbo Tang, Pengcheng Li, Hao Li, Qianyuan Ran, Yiqun Guo, Shanyuan Gao, Xin Long, Jie Zhang, Yong Li, Zhisheng Xia, Liuyihan Song, Yingya Zhang, Pan Pan, Guohui Wang, Xiaowei Jiang

Published: 2021, Last Modified: 10 May 2023IEEE Micro 2021Readers: Everyone

Abstract: Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algorithms to fully make use of heterogeneous interconnects simultaneously. And the experimental results show significant performance improvement.

0 Replies