Compositional Investigation: Why Reasoning Enables Tool-Using Agents to Fix What They Diagnose
Keywords: compositional learning, tool-using agents, root cause analysis, chain-of-thought reasoning, benchmark, incident remediation
TL;DR: Reasoning models (5/5) vs non-reasoning (1/5) on a closed-loop microservice bug-fix benchmark. Chain-of-thought acts as a compositional controller not just accuracy boost. Same model, reasoning on/off, 5x pass rate difference
Abstract: We introduce rca llms, a closed-loop bench-
mark in which an agent must diagnose and patch
microservice bugs by composing four sub-skills
(evidence gathering, cross-service correlation,
code localization, and code editing), with cor-
rectness verified by a deterministic reproducer
rather than an LLM judge. Across six frontier
models and five multi-hop incidents, all three rea-
soning variants achieve 5/5 pass rates while all
three non-reasoning variants score 1/5, with most
failures being non-terminations. The same base
model (grok-4-1-fast) scores 5/5 with rea-
soning and 1/5 without, isolating chain-of-thought
as a compositional controller that commits to sub-
conclusions and advances the pipeline, rather than
a simple accuracy booster.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 122
Loading