LSTM-Based Unsupervised Anomaly Detection in High-Performance Computing: A Federated Learning Approach
Abstract: High-Performance Computing (HPC) systems are intricate machines that must be run at maximum efficiency to justify their high cost and to minimize environmental impact. Any anomalies that hinder the smooth operation of supercomputing nodes are a significant issue in modern HPC systems. Therefore, the development of automated anomaly detection methods is a crucial area of research within the HPC domain. Machine Learning (ML) models have shown great success in identifying anomalies on individual nodes, especially as contemporary super-computers are outfitted with advanced monitoring systems that provide large datasets for training. However, the potential to combine data from various nodes and to utilize collective ML models remains largely unexplored. Federated Learning (FL) presents a promising approach by enabling individual models to share and learn from one another. Although FL has been employed in areas like healthcare and IoT, its application in HPC is still novel. This study explores how FL can be leveraged to enhance anomaly detection in HPC systems. Using data from a real-world supercomputer, the approach has shown significant promise, boosting the average F1-score from 0.307 to 0.815, and the average AUC from 0.368 to 0.77. Moreover, FL drastically reduces the time required to gather sufficient data for training, allowing faster deployment of detection models. Traditional ML models typically need about 4.5 months of data to perform effectively, but FL can achieve the same with only 1.2 weeks of data, resulting in a 15-fold reduction in data requirements.
Loading