Keywords: GPU, cluster health, prediction
TL;DR: We propose a ML based forecasting algorithm designed for predicting health status of GPU clusters based on GPU telemetry.
Abstract: Large-scale training jobs, especially those utilizing GPU clusters, are vulnerable to various failure modes, including individual hardware faults, network issues, and software-level problems. These failures can lead to significant downtime, wasted computational resources, and delays in research or production workflows. We propose a ML based forecasting algorithm designed for predicting health status of GPU clusters. Through extensive ablation studies, we found that cascading 1D CNNs achieved the best performance. The model leverages time-series data representing various cluster metrics, such as temperature, power consumption, and resource utilization towards predicting cluster failures, enabling proactive maintenance and resource optimization. By tuning differently per use-case, the model is able to achieve overall PRAUC of 0.90, and precision and recall of 0.99 and 0.90 respectively. This work is motivated by the need to improve the reliability and efficiency of large-scale training jobs that are susceptible to hardware and software failures.
Submission Number: 43
Loading