ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication LibraryDownload PDFOpen Website

Published: 01 Jan 2021, Last Modified: 10 May 2023IEEE Micro 2021Readers: Everyone
Abstract: Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algorithms to fully make use of heterogeneous interconnects simultaneously. And the experimental results show significant performance improvement.
0 Replies

Loading