Learning a Hierarchical Monitoring System for Detecting and Diagnosing Service IssuesOpen Website

2015 (modified: 16 Jul 2019)KDD 2015Readers: Everyone
Abstract: We propose a machine learning based framework for building a hierarchical monitoring system to detect and diagnose service issues. We demonstrate its use for building a monitoring system for a distributed data storage and computing service consisting of tens of thousands of machines. Our solution has been deployed in production as an end-to-end system, starting from telemetry data collection from individual machines, to a visualization tool for service operators to examine the detection outputs. Evaluation results are presented on detecting 19 customer impacting issues in the past three months.
0 Replies

Loading