Compositional Investigation: Why Reasoning Enables Tool-Using Agents to Fix What They Diagnose

Published: 27 May 2026, Last Modified: 27 May 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: compositional learning, tool-using agents, root cause analysis, chain-of-thought reasoning, benchmark, incident remediation
TL;DR: Reasoning models (5/5) vs non-reasoning (1/5) on a closed-loop microservice bug-fix benchmark. Chain-of-thought acts as a compositional controller not just accuracy boost. Same model, reasoning on/off, 5x pass rate difference
Abstract: We introduce rca llms, a closed-loop bench- mark in which an agent must diagnose and patch microservice bugs by composing four sub-skills (evidence gathering, cross-service correlation, code localization, and code editing), with cor- rectness verified by a deterministic reproducer rather than an LLM judge. Across six frontier models and five multi-hop incidents, all three rea- soning variants achieve 5/5 pass rates while all three non-reasoning variants score 1/5, with most failures being non-terminations. The same base model (grok-4-1-fast) scores 5/5 with rea- soning and 1/5 without, isolating chain-of-thought as a compositional controller that commits to sub- conclusions and advances the pipeline, rather than a simple accuracy booster.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 122
Loading