Abstract: Currently, many large online systems are constructed with a microservice architecture. Due to the complex dependencies, the failure of a service in such a system can cause an avalanche, which directly affects user experience and the company’s revenue. It is critical for service operators to build anomaly detection services to monitor online systems closely and comprehensively. Even though a large number of anomaly detection approaches have been proposed, few of them can simultaneously adapt to hundreds of operators’ practical detection requirements. To tackle this problem, we proposed LSTM-AAD, a Bayesian LSTM based active anomaly detection service. LSTM-AAD extracts anomaly features based on the common patterns among metrics, introduces a Bayesian LSTM model to detect anomalies in time series metrics, and employs active learning to update the online model via a small number of uncertain feedback samples. In addition, the proposed user-oriented service can be quickly responsive to operators’ further requirements. We conduct extensive experiments on real time series metrics of large online services in Tencent. The results indicate that LSTM-AAD significantly outperforms other state-of-the-art methods. Moreover, our approach can detect anomalies efficiently out of box to work in a large-scale system.
Loading