Abstract: We discuss the potential improvement large language models (LLM) can provide in incident management and how they can overhaul the ways operators conduct incident management today. We propose a holistic framework for building an AI helper for incident management and discuss the several avenues of future research needed to achieve it. We thoroughly analyze the fundamental requirements the community should consider when designing such helpers. Our work is based on discussions with operators of a large public cloud provider and their prior experiences both in incident management and with attempts to improve the incident management experience through various forms of automation.
Loading