Abstract: Hyperscale web service infrastructures are becoming increasingly complex and facing a variety of threats, raising the demand for more sophisticated automated operations and diagnosis solutions. Existing anomaly root cause localization approaches often focus on Service-level components without drilling down to the lower-level resources where services are deployed, hindering the implementation of fine-grained failure fix measures. This paper introduces a challenging task called global diagnosis and addresses it by proposing a technique called G-Cause, which is applicable to both Service-level and host-level root cause analysis scenarios. G-Cause builds a highly adaptive diagnostic framework based on the frequency-domain and time-domain characteristics of monitoring metrics, allowing it to handle global diagnosis requirements from app to host with minimal parameter adjustments. We deploy and validate our approach in two typical scenarios: homogeneous metric diagnosis from app to microservice, and heterogeneous metric diagnosis for various host resources. The results demonstrate that G-Cause outperforms state-of-the-art diagnosis algorithms while providing strong interpretability. Our approach helps operators understand the core mechanism of anomaly propagation and adjust their management strategies more effectively. With these strengths, G-Cause successfully services our global product operations and also makes an impressive contribution in many other workflows.
Loading