Learning anomalies from graph: predicting compute node failures on HPC clusters

Published: 06 Nov 2024, Last Modified: 14 Nov 2024NLDL 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Artificial Intelligence, Machine Learning, Graphs, HPC, Data Center, Anomalies Forecasting
TL;DR: We introduce and compare several models that leverage graph embeddings to predict compute node failures on HPC clusters.
Abstract: Today, high-performance computing (HPC) systems play a crucial role in advancing artificial intelligence. Nevertheless, the estimated global data center electricity consumption in 2022 was around 1\% of the final global electricity demand. Therefore, as HPC systems advance towards Exascale computing, research is required to ensure their growth is sustainable and environmentally friendly. Data from infrastructure monitoring can be leveraged to predict downtimes, ensure these are treated in time, and increase the overall system's utilization. In this paper, we compare four machine-learning approaches, three of them based on graph embeddings, to predict compute node downtimes. The experiments were performed with data from Marconi 100, a tier-0 production supercomputer at CINECA in Bologna, Italy. Our results show that the machine learning models can accurately predict downtime, surpassing current state-of-the-art models.
Submission Number: 51
Loading