Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning

Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, Chao Yang

Published: 27 Apr 2024, Last Modified: 12 Dec 2024ASPLOS 24EveryoneCC BY-SA 4.0

Abstract: Efficiently training large language models (LLMs) necessi- tates the adoption of hybrid parallel methods, integrating multiple communications collectives within distributed par- titioned graphs. Overcoming communication bottlenecks is crucial and is often achieved through communication and computation overlaps. However, existing overlap methodolo- gies tend to lean towards either fine-grained kernel fusion or limited operation scheduling, constraining performance optimization in heterogeneous training environments. This paper introduces Centauri, an innovative framework that encompasses comprehensive communication partition- ing and hierarchical scheduling schemes for optimized over- lap. We propose a partition space comprising three inher- ent abstraction dimensions: primitive substitution, topology- aware group partitioning, and workload partitioning. These dimensions collectively create a comprehensive optimiza- tion space for efficient overlap. To determine the efficient overlap of communication and computation operators, we de- compose the scheduling tasks in hybrid parallel training into three hierarchical tiers: operation, layer, and model. Through these techniques, Centauri effectively overlaps communi- cation latency and enhances hardware utilization. Evalua- tion results demonstrate that Centauri achieves up to 1.49× speedup over prevalent methods across various parallel train- ing configurations.