Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers

Published: 2024, Last Modified: 15 Jan 2026ISSRE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Alongside the high performance built on massive hardware, ever-larger computer systems bear tons of hardware alerts every day during reliability maintenance. Based on an exploratory study on a representative supercomputer system, this work first characterizes supercomputer alerts as an overload of continuous bursts for the operators. Yet, existing similarity-based aggregation solutions, tuned for in-band textual alerts, are myopic by finding dissimilar representatives instead of looking into the semantics in the supercomputer context. To fill the void of supercomputer alert aggregation, we propose the SuperAgg framework to extract the hierarchical patterns of real-world alerts and use them for online alert management. SuperAgg jointly integrates unsupervised state detection of time series and expert analysis to successfully discover 4 categories of sensor-tier alert patterns and exploits primary-and-secondary statistics between sensors for system-tier correlation patterns. With such extracted knowledge, SuperAgg then identifies the formulated patterns online and uses spatiotemporal combined strategies to reduce the alert influx. Evaluations on alerts generated from a production supercomputer show that SuperAgg provides over 98% aggregation rate and significantly higher aggregation accuracy (over 83.8% and 43.2% on different datasets) than 3 baselines. Production deployment further demonstrates its effectiveness from the perspective of system operators. The source code is available at: https://github.com/Txh-User/SuperAgg.
Loading