Forecasting machine degradation of GPU Clusters

Shengnan Cai; Shuxin Nie; Zhehui Chen; Nupur Gulalkari; George Vanica; Chetna Jain; Sethuraman Sankaran

Forecasting machine degradation of GPU Clusters

Shengnan Cai, Shuxin Nie, Zhehui Chen, Nupur Gulalkari, George Vanica, Chetna Jain, Sethuraman Sankaran

Published: 30 Oct 2025, Last Modified: 04 Nov 2025MLForSys2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: GPU, cluster health, prediction

TL;DR: We propose a ML based forecasting algorithm designed for predicting health status of GPU clusters based on GPU telemetry.

Abstract: Large-scale training jobs, especially those utilizing GPU clusters, are vulnerable to various failure modes, including individual hardware faults, network issues, and software-level problems. These failures can lead to significant downtime, wasted computational resources, and delays in research or production workflows. We propose a ML based forecasting algorithm designed for predicting health status of GPU clusters. Through extensive ablation studies, we found that cascading 1D CNNs achieved the best performance. The model leverages time-series data representing various cluster metrics, such as temperature, power consumption, and resource utilization towards predicting cluster failures, enabling proactive maintenance and resource optimization. By tuning differently per use-case, the model is able to achieve overall PRAUC of 0.90, and precision and recall of 0.99 and 0.90 respectively. This work is motivated by the need to improve the reliability and efficiency of large-scale training jobs that are susceptible to hardware and software failures.

Submission Number: 43

Loading