Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
Abstract: Efficiently training large language models (LLMs) necessi-
tates the adoption of hybrid parallel methods, integrating
multiple communications collectives within distributed par-
titioned graphs. Overcoming communication bottlenecks is
crucial and is often achieved through communication and
computation overlaps. However, existing overlap methodolo-
gies tend to lean towards either fine-grained kernel fusion
or limited operation scheduling, constraining performance
optimization in heterogeneous training environments.
This paper introduces Centauri, an innovative framework
that encompasses comprehensive communication partition-
ing and hierarchical scheduling schemes for optimized over-
lap. We propose a partition space comprising three inher-
ent abstraction dimensions: primitive substitution, topology-
aware group partitioning, and workload partitioning. These
dimensions collectively create a comprehensive optimiza-
tion space for efficient overlap. To determine the efficient
overlap of communication and computation operators, we de-
compose the scheduling tasks in hybrid parallel training into
three hierarchical tiers: operation, layer, and model. Through
these techniques, Centauri effectively overlaps communi-
cation latency and enhances hardware utilization. Evalua-
tion results demonstrate that Centauri achieves up to 1.49×
speedup over prevalent methods across various parallel train-
ing configurations.
Loading