Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization

Published: 01 Jan 2025, Last Modified: 09 May 2025HPCA 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The emergence of Large Language Models (LLMs) has necessitated the adoption of distributed training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, the efficiency of large-scale distributed training systems is often suboptimal due to the increased likelihood of hardware errors in high-end GPU products and the heightened risk of network traffic collisions. Specifically, GPUs involved in the same job require periodic synchronization to exchange necessary data, such as gradients, parameters, or activations. As a result, any local hardware failure can disrupt training tasks, and the inability to swiftly identify faulty components leads to a significant waste of GPU resources. Moreover, prolonged communication due to traffic collisions can substantially increase GPU waiting times. To address these challenges, we propose a communication-driven solution, namely the C 4. The key insights of C 4 are twofold. First, the load in distributed training exhibits homogeneous characteristics and is divided into iterations through periodic synchronization, therefore hardware anomalies would incur certain syndrome in collective communication. By leveraging this feature, $\mathbf{C} 4$ can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving a limited number of long-lived flows, allows C 4 to efficiently execute traffic planning, substantially reducing bandwidth competition among these flows. The $\mathbf{C 4}$ has been extensively deployed across real-world production systems in a hyperscale cloud provider, yielding a significant improvement in system efficiency, from 30% to $\mathbf{4 5 \%}$. This enhancement is attributed to a $\mathbf{3 0 \%}$ reduction in error-induced overhead and a 15% reduction in communication costs.
Loading