Causal Modeling based Fault Localization in Cloud Systems using Golden Signals

Pooja Aggarwal, Seema Nagar, Ajay Gupta, Larisa Shwartz, Prateeti Mohapatra, Qing Wang, Amit M. Paradkar, Atri Mandal

Published: 01 Jan 2021, Last Modified: 03 Oct 2023CLOUD 2021Readers: Everyone

Abstract: In cloud-native applications, a large fraction of operational failures, known as outages, result in violations of Service Level Objectives (SLOs). SLOs are defined around specific measurable characteristics: availability, throughput, frequency, response time, and quality. Four metrics, latency, traffic, errors, and saturation, ensure coverage for most outages of an application. These are often called golden signals. The dynamicity and complexity of cloud-native applications complicate Site Reliability Engineers’ (SREs) efforts in problem determination, in particular in its fault localization. The fault localization is often a try-and-error process in which SREs rely on their domain knowledge and experience. It is laborious and frequently results in long Mean Time To Resolution (MTTR) for outages. This paper describes a lightweight fault localization system, that establishes causal relationships among the golden signal service errors and error logs, and further leverages PageRank centrality of the derived causal graph for generating a ranked list of faulty microservices.

0 Replies