Abstract: In a service system, operations engineers generally deploy numerous monitoring mechanisms in system components to detect anomalies caused by system faults and generate alerts, also known as alarms, that record the phenomenon of anomalies. Due to the topological relationships between system components, a system fault in a system component may affect other components and result in various local anomalies and generate multiple alerts across different components. Therefore, to facilitate troubleshooting, alerts of the same system fault are usually correlated into one group, called alert incident. However, although there are existing approaches that can automatically correlate alerts for operations engineers, analyzing alert incidents still rely on manual work. In this paper, we propose an approach, VOCE (Virtual On-Call Engineer). Using the emerging capabilities of a large language model, VOCE can automatically comprehend the anomaly information described by alerts and emulate the process of operations engineers analyzing an alert incident. Extensive experiments conducted on real alert incidents and two popular large language models demonstrate the effectiveness and efficiency of VOCE in automatically analyzing alert incidents.
External IDs:dblp:conf/fase/ChenCSWW25
Loading