Abstract: Microservice architecture is used by leading companies to develop their large-scale software systems. These systems comprise numerous nodes, diverse service types and instances, and substantial volumes of data. Current research usually requires a central node to collect massive data from the system to build an anomaly detection model, encountering two significant limitations: 1) Most research trains a model for the entire system, ignoring the unique characteristics of individual nodes. Additionally, processing vast system-wide data in a single node imposes significant resource demands. 2) Microservice systems change frequently, and the historical data distribution differs significantly from the real data distribution, resulting in concept drift. Thus, we proposes DistriAD, a distributed anomaly detection method specifically designed for large-scale microservice systems. DistriAD involves a lightweight anomaly detection model deployed on each distributed node for precise anomaly detection, thus enhancing its accuracy. Furthermore, DistriAD utilizes a federated learning framework and a continuous updating method incorporating human feedback to update model parameters and address concept drift. Experimental validation on public datasets, e.g., TrainTicket-based and GAIA, and a proprietary test system dataset demonstrate that DistriAD outperforms baseline methods, improving F1-score up to 39.2 %. We believe that this work can provide insights into distributed anomaly detection in large-scale microservice systems, thereby improving their performance.
External IDs:dblp:conf/icws/LiLZWWBLGMP25
Loading