Executable Ground Truth: A Closed-Loop Benchmark for Evaluating LLM Agents on Microservice Incident Remediation

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM agents, executable ground truth, incident remediation, microservices, TraceRoot, benchmarking, tool-use agents, root cause analysis, non-termination, reasoning vs non-reasoning, closed-loop evaluation, system verification, failure modes, code patching, agent evaluation
TL;DR: TraceRoot evaluates agents via executable fixes, not reports. Reasoning models solve all incidents, while non-reasoning fail due to non-termination or no code edits—revealing failures hidden by LLM-judge benchmarks.
Abstract: Most LLM agent benchmarks score the agent’s output description, not its effect on the world. We introduce TraceRoot, a microservice incident benchmark where an agent investigates structured logs, edits the faulty source file, and has its patch verified by an end-to-end reproducer with no LLM judge. Across six frontier models and five inci- dents, three reasoning-enabled models achieve 5/5 pass rates; three non-reasoning models score 1/5, failing primarily through non-termination, a failure mode invisible to report-based evaluators. We release all artifacts to support reproducible evaluation.
Paper Type: Short (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 72
Loading