Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data

Published: 2024, Last Modified: 06 Feb 2025IEEE Trans. Serv. Comput. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Microservices are widely adopted in large IT enterprises, leveraging the scalability, resiliency, and elasticity of the cloud-native architecture. Effective root cause analysis is crucial for ensuring the reliability of such cloud-native systems. Many efforts have focused on using the three modalities of observability data–traces, metrics, and logs. However, existing approaches are limited by inconsistent problem definitions and cloud-native heterogeneity. To address these challenges, we propose HolisticRCA, a root cause analysis framework in cloud-native systems from a holistic perspective. HolisticRCA formally defines root cause analysis through three dimensions. Then HolisticRCA uses an “assembling building blocks” strategy to address the cloud-native heterogeneity. It maps each observability feature into a shared vector space and concatenates the vector embeddings associated with each resource entity for standardized resource entity vector embeddings. Then it applies Graph Attention Network to capture intertwined resource entity relations and incorporates mask embeddings to enable holistic analysis. The evaluation results on three public datasets show that HolisticRCA outperforms existing approaches in holistic root cause analysis of cloud-native systems.
Loading