Keywords: AI, Observability, Artificial Intelligence, Agentic Observability
TL;DR: We benchmark agent observability: OpenTelemetry misses 57% of agent faults. Our 9 agent-specific span kinds catch all 14 fault types. SWE-bench shows 75% of failures are reasoning loops—invisible without agent telemetry. Open-source toolkit included.
Abstract: LLM-based autonomous agents fail in ways that existing observability infrastructure cannot detect. OpenTelemetry's GenAI semantic conventions cover LLM invocation and tool execution but leave five critical agent orchestration phases — planning, reasoning, safety monitoring, inter-agent delegation, and memory management — without span-level representation. We present AgentTelemetry, an open-source benchmark suite and toolkit for evaluating fault detection in agent systems. The benchmark defines (1) a taxonomy of 14 fault types mapped to 9 agent-specific span kinds, (2) a controlled evaluation harness of 2,940 configurations (14 faults × 5 observability conditions × 7 frameworks × 6 models), and (3) a pip-installable library (3,700+ LOC, 78 tests) with adapters for seven frameworks. On the controlled benchmark, the full span taxonomy achieves a Fault Detection Rate (FDR) of 1.000 — an upper bound confirming structural completeness — compared to 0.429 for vanilla OpenTelemetry and OTel+GenAI. An ablation study proves all nine span kinds are necessary: removing any one makes at least one fault type undetectable. A case study on 112 SWE-bench Lite instances reveals that reasoning loops account for 75% of agent failures (95% CI: [66%, 82%]) — a failure mode invisible to vanilla OTel — and a telemetry-guided intervention improves the patch rate by +8.3 pp over a matched control. All code, data, and benchmark configurations are open-source for reproducibility.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 9
Loading