The MCP Quality Paradox: Enhancing Tool Effectiveness in Long-Horizon Reasoning through Phase-Aware State Abstraction

Published: 13 Dec 2025, Last Modified: 16 Jan 2026AILaw26EveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: MCP tool quality, long-horizon reasoning, information heterogeneity
Paper Type: Short papers / work-in-progress
TL;DR: Quality-aware tool selection via phase-aware state abstraction for long-horizon reasoning
Abstract: Large Language Model (LLM) agents are increasingly tasked with long-horizon deep research, utilizing an expanding ecosystem of heterogeneous tools via the Model Context Protocol (MCP). While integrating diverse information sources, ranging from structured databases (SQL) to unstructured retrieval systems (Web search, ArXiv), theoretically enhances capability, it introduces significant integration challenges. Specifically, the unweighted fusion of high-fidelity internal data with noisy external retrieval can compromise reasoning consistency, while the linear accumulation of intermediate tool outputs leads to context suffocation, where critical signals are diluted by redundant interaction history. To address these challenges, we introduce Q-STEAM (\textbf{Q}uality-aware \textbf{St}at\textbf{e} \textbf{A}bstraction for \textbf{M}ulti-Hop Reasoning), a framework that reformulates long-horizon reasoning as a Phase-Aware Decision Process. Unlike mono-contextual paradigms, Q-STEAM: \textbf{Firstly}, dynamically evaluates tool reliability within each reasoning phase: Acquisition (retrieval quality), Analysis (extraction accuracy), and Propagation (aggregation robustness), rather than applying uniform tool credibility scores; \textbf{Secondly}, implements Phase-Aware State Abstraction, which synthesizes reasoning history into evolved reports at phase boundaries, selectively discarding redundancy while preserving reasoning continuity. We validate Q-STEAM on HotpotQA (controlled multi-hop reasoning) and a novel Legal Case Synthesis dataset (high-stakes, real-world uncertainty). Experimental results demonstrate that Q-STEAM achieves improvement over baselines.
Submission Number: 56
Loading