G-Cause: Parameter-free Global Diagnosis for Hyperscale Web Service Infrastructures

Xinrui Jiang, Yang Zhang, Tingzhu Bi, Xiangzhuang Shen, Yu Zhang, Yicheng Pan, Meng Ma, Linlin Han, Feng Wang, Xian Liu, Ping Wang

Published: 2024, Last Modified: 09 Feb 2025ICWS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Hyperscale web service infrastructures are becoming increasingly complex and facing a variety of threats, raising the demand for more sophisticated automated operations and diagnosis solutions. Existing anomaly root cause localization approaches often focus on Service-level components without drilling down to the lower-level resources where services are deployed, hindering the implementation of fine-grained failure fix measures. This paper introduces a challenging task called global diagnosis and addresses it by proposing a technique called G-Cause, which is applicable to both Service-level and host-level root cause analysis scenarios. G-Cause builds a highly adaptive diagnostic framework based on the frequency-domain and time-domain characteristics of monitoring metrics, allowing it to handle global diagnosis requirements from app to host with minimal parameter adjustments. We deploy and validate our approach in two typical scenarios: homogeneous metric diagnosis from app to microservice, and heterogeneous metric diagnosis for various host resources. The results demonstrate that G-Cause outperforms state-of-the-art diagnosis algorithms while providing strong interpretability. Our approach helps operators understand the core mechanism of anomaly propagation and adjust their management strategies more effectively. With these strengths, G-Cause successfully services our global product operations and also makes an impressive contribution in many other workflows.