SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure Prediction

Published: 01 Jan 2024, Last Modified: 16 May 2025WWW (Companion Volume) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Cloud failure prediction (e.g., disk failure prediction, memory failure prediction, node failure prediction, etc.) is a crucial task for ensuring the reliability and performance of cloud systems.However, the problem of class imbalance poses a huge challenge for accurate prediction as the number of healthy components (majority class) in a cloud system is much larger than the number of failed components (minority class). The consequences of this class imbalance include biased model performance and insufficient learning, as the model may lack adequate information to learn the characteristics associated with cloud failure effectively. Moreover, current methods for addressing the class imbalance problem, such as SMOTE and its variants, exhibit certain drawbacks, such as generating noisy samples and struggling to maintain sample diversity, which limit their effectiveness in addressing the challenges presented by the class imbalance in cloud failure prediction. In this paper, we propose a novel oversampling method for imbalanced classification, named SOIL (Score cOnditioned dIffusion modeL), which employs a score-conditioned diffusion model to generate high-quality synthetic samples for the minority class, more accurately representing real-world cloud failure patterns. By incorporating classification probabilities as conditional scores, SOIL offers supervision to the generation process, effectively limiting noise production while maintaining sample diversity. Through extensive experiments on various public and industrial datasets, upon adopting our method, the cloud failure prediction model's F1-score is improved by an average of 5.39% and consistently outperforms state-of-the-art competitors in addressing the class imbalance problem, which confirm the effectiveness and robustness of SOIL. In addition, SOIL has been successfully applied to a global large-scale cloud platform serving billions of customers, demonstrating its practicability.
Loading