The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small Classifier

Published: 01 Jan 2024, Last Modified: 06 Feb 2025ASE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Failure root cause analysis (RCA), which systematically identifies underlying faults, is essential for ensuring the reliability of widely adopted microservice-based applications and cloud-native systems. However, manual analysis by simple rules faces significant burdens due to the heterogeneous nature of resource entities and the massive amount of observability data. Furthermore, existing approaches for automating RCA struggle to perform in-depth fault analysis without extensive fault labels. To address the scarcity of fault labels, we examine an extreme RCA scenario where each fault type has only one example (one-shot). We propose LasRCA, a framework for one-hot RCA in cloud-native systems that leverages the collaboration of the large language model (LLM) and the small classifier. In the training stage, LasRCA initially trains a small classifier based on one-shot fault examples. The small classifier then iteratively selects high-confusion samples and receives feedback on their fault types from LLM-driven fault labeling. These samples are applied to retrain the small classifier. In the inference stage, LasRCA performs a joint RCA through the collaboration of the LLM and small classifier, achieving a trade-off between effectiveness and cost. Experiment results on public datasets with heterogeneous nature and prevalent fault types show the effectiveness of LasRCA in one-shot RCA.
Loading