Identifying Failure Root Causes for Cloud-Native Microservice Applications

Raphael Rouf, Farhoud Jafari Kaleibar, Marin Litoiu, Mohammadreza Rasolroveicy, Seema Nagar, Prateeti Mohapatra, Pranjal Gupta, Ian Watts

Published: 2025, Last Modified: 25 Jan 2026ACSOS-C 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Cloud-native microservice applications depend on reliable platforms to ensure stable performance, even under resource overload faults. However, understanding the root causes of system failures holistically remains a significant challenge. This paper proposes a novel, root cause-oriented framework that supports autonomic, self-managing systems with humans in the loop. Our approach leverages a three-fold modality of observability data—logs, metrics, and traces—to build a multi-perspective view of system behavior. We enhance preprocessing to extract metric anomaly scores and log semantics (e.g., Template ID counts and Golden Signal counts), which are then fused to train a GNN-GRU model. This model captures spatial and temporal patterns across services to classify failure types and identify the root causes behind them. The resulting root cause predictions—including correlated anomalies and their associated source and target services—are analyzed to provide context-rich insights, aiding human operators (e.g., SREs) in debugging and diagnosis. Our framework fits naturally into the Monitor-Analyze-Plan-Execute (MAPE) loop, enabling proactive fault management and feedback-driven improvement. Evaluations using the public MicroSS dataset—comprising faults like resource saturation and configuration errors—demonstrate the effectiveness of our method in accurately identifying failure origins and supporting operational resilience.

External IDs:dblp:conf/acsos/RoufKLRNMGW25