Abstract: Modern enterprise systems exhibit complex interdependen-cies
that make observability and incident response increas-ingly
challenging. Manual alert triage, which typically in-volves log
inspection, API verification, and cross-referencing operational
knowledge bases, remains a major bottleneck in reducing mean
recovery time (MTTR). This paper presents an agentic
observability framework deployed within Adobe’s e-commerce
infrastructure that autonomously performs alert triage using a
ReAct paradigm. Upon alert detection, the agent dynamically
identifies t he a ffected s ervice, retrieves and analyzes correlated
logs across distributed systems, and plans context-dependent
actions such as handbook consulta-tion, runbook execution, or
retrieval-augmented analysis of recently deployed code.
Empirical results from production deployment indicate a 90%
reduction in mean time to insight compared to manual triage,
while maintaining comparable di-agnostic accuracy. Our results
show that agentic AI enables an order-of-magnitude reduction
in triage latency and a step-change in resolution accuracy,
marking a pivotal shift toward autonomous observability in
enterprise operations.
Loading