Executable Ground Truth: A Closed-Loop Benchmark for Evaluating LLM Agents on Microservice Incident Remediation
Keywords: LLM agents, executable ground truth, incident remediation, microservices, TraceRoot, benchmarking, tool-use agents, root cause analysis, non-termination, reasoning vs non-reasoning, closed-loop evaluation, system verification, failure modes, code patching, agent evaluation
TL;DR: TraceRoot evaluates agents via executable fixes, not reports. Reasoning models solve all incidents, while non-reasoning fail due to non-termination or no code edits—revealing failures hidden by LLM-judge benchmarks.
Abstract: Most LLM agent benchmarks score the agent’s
output description, not its effect on the world.
We introduce TraceRoot, a microservice incident
benchmark where an agent investigates structured
logs, edits the faulty source file, and has its patch
verified by an end-to-end reproducer with no LLM
judge. Across six frontier models and five inci-
dents, three reasoning-enabled models achieve
5/5 pass rates; three non-reasoning models score
1/5, failing primarily through non-termination, a
failure mode invisible to report-based evaluators.
We release all artifacts to support reproducible
evaluation.
Paper Type: Short (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 72
Loading