Decentralized Root Cause Analysis for Cloud-Native Microservices - Experience with Distributed PageRank in Production

Published: 22 Jun 2026, Last Modified: 06 May 2026The 56th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) Conference 2026EveryoneCC BY 4.0
Abstract: Root cause analysis in microservice systems typically relies on centralized telemetry aggregation, which becomes a bottleneck as deployments scale. We present an architecture that removes the central component entirely: each microservice runs a sidecar diagnostic agent that computes a local anomaly score and participates in gossip-based distributed Personalized PageRank over the service dependency graph. On the RCAEval benchmark (735 failure cases, 12–64 services), Top-1 accuracy is 94.7%, statistically indistinguishable from centralized PageRank at 94.1%, with roughly a third lower diagnosis latency and 74.5% less network traffic. We deployed the system for four weeks on a 15-node AWS EKS cluster handling 1% of live e- commerce traffic. The deployment processed 112 injected faults (94.6% Top-1) and 3 organic incidents (1 of 3 correct). The production experience exposed problems that benchmarks do not expose: cross-availability-zone gossip asymmetry causing split-brain diagnoses, inadequate global anomaly thresholds, and external-dependency blind spots. We also ran into operational issues we had not anticipated, such as log volume from agents overwhelming the existing logging pipeline.
Loading