% Replace placeholder author metadata before submission.
\documentclass{ceurart}

% \sloppy

\usepackage{enumitem}
%\usepackage[cal=cm]{mathalfa}
\usepackage{subcaption}
\usepackage[cal=cm]{mathalfa}
\setlist[itemize]{itemsep=0.25em, topsep=0.35em, parsep=0pt, partopsep=0pt}
\setlist[enumerate]{itemsep=0.5em, topsep=0.35em, parsep=0pt, partopsep=0pt}
\setlength{\belowcaptionskip}{-10pt}

\graphicspath{{../images/}}

\title{Representing Agentic Tools in Knowledge Graphs for Structure-Aware Tool Discovery Under Tool Overload}

\copyrightyear{2026}
\copyrightclause{Copyright for this paper by its authors.
  Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).}

\conference{GENAIK-NORA 2026: Joint Workshop on Generative AI and Knowledge Graphs and Knowledge Graphs \& Agentic Systems Interplay, IJCAI-ECAI 2026 Workshops, August 2026, Bremen, Germany}



\begin{document}

\author[1]{Isaiah Onando Mulang'}[%
  email=mualang.onando@sap.com,
]
\author[1]{Johannes Thaller}[%
  email=johannes.thaller@sap.com,
]
\author[1]{Tushar Trivedi}[%
  email=tushar.trivedi@sap.com,
]
\author[1]{Lars Heling}[%
  email=lars.heling@sap.com,
]
\author[1]{Felix Sasaki}[%
  email=felix.sasaki@sap.com,
]
\address[1]{SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany}


\begin{abstract}
Large language model (LLM) agents increasingly rely on external tools, yet most tool ecosystems still expose those tools as unstructured textual descriptions or JSON schemas. As tool inventories grow, this becomes a retrieval problem where the agent must surface a small relevant set under context and tool-budget constraints. We study knowledge-graph-based tool representation for agentic systems through a lightweight ontology for Model Context Protocol (MCP) tools. The ontology models tools, servers, capabilities, and parameters, and treats required versus optional inputs as first-class relations. Using real MCP tool schemas extracted from publicly available servers, we build an RDF knowledge graph. We instantiate this Knowledge Graph on MCP-Atlas, a benchmark for tool-use competency built around real MCP servers, and compare a KG-augmented discovery workflow against a text-only baseline across multiple frontier models and two exposure regimes: the benchmark's smaller task-level tool menus and an overload setting with an all-tools registry of approximately 269 tools over 258 executed tasks.
The early empirical results show specific and actionable insights. In smaller curated tool settings, direct text-only exposure remains stronger for all tested models. However, under overload where the unstructured baseline is constrained by a maximum tool budget, KG-based filtering improves GPT-5 from 0.478 to 0.542 mean coverage. For Claude 4.6 Sonnet in the all-tools condition, the KG retains roughly 89\% of the text baseline's coverage while reducing the candidate set from about 270 tools to 4.6 tools on average. Qualitative error analysis indicates that the KG helps primarily by reducing tool overload, name ambiguity, and backend confusion, while its main weakness is incomplete recall caused by missing or imperfect capability assignments. The central conclusion validates the value of knowledge graphs as a structure-aware compression layer for large, noisy tool registries, and opens a larger research question on best approaches to represent tools-knowledge-graphs together with strong textual tool descriptions.

\end{abstract}

\begin{keywords}
knowledge graphs \sep
tool discovery \sep
MCP \sep
agentic AI \sep
ontology engineering \sep
tool use \sep
LLM agents
\end{keywords}

\maketitle

\section{Introduction}

Tool use has become a core capability of modern LLM agents~\cite{wolflein-etal-2025-llm,li-2025-review,xu2026evolutiontoolusellm}. Instead of answering from parametric memory alone, agents are designed to search documents, query databases, call APIs, read files, or compose multiple tools into a task-specific workflow that meets a target goal. An incipient research direction for tool use in agentic settings follows  multi-tool orchestration over long trajectories~\cite{xu2026evolutiontoolusellm} which in turn demands efficient discovery of the tools in use. In practice, however, tool interfaces are still mostly exposed as flat lists of names, descriptions, and JSON schemas. This is workable at small scale, but it becomes brittle once agents face tens or hundreds of overlapping, potentially ambiguous tools across multiple servers and domains. Concomitantly, Knowledge Graphs~\cite{10.1007/s10462-023-10465-9,10.5555/1785162.1785216} have been established as authoritative way to structure information in the enterprise and for consumption by generative models through ideas such as graph-retrieval-augmented generation (Graph-RAG)~\cite{10.5555/3495724.3496517, 10.1145/3777378} or as encoded graph tokens ~\cite{fatemi2024talk,Perozzi2024LetYG,hu-etal-2024-lets} imbibed into large language models, and several tasks have been evolved over the last decades concerning efficient graph representation.

The overarching research question in this work is to investigate whether a Knowledge Graph (KG) can serve as a better representation layer for tool discovery anchored on the central idea that tool selection is not only a lexical matching problem. It is also a structural reasoning problem involving capability hierarchies, server provenance, parameter semantics, and constraints on how tools can be used. A Knowledge Graph makes these structures explicit, supports transparent traversal and validation, and reduces the number of tools that must be surfaced to the downstream agent.

Although there are numerous ways to represent and serve tools for agentic use, we focus on the Model Control Protocol (MCP)~\cite{anthropicIntroducingModel,MCP-Landscape} because it has emerged as a practical interoperability layer for agentic tools and because MCP-Atlas provides a community-relevant benchmark built on real servers, controlled distractor tools, multi-step workflows, and claims-based evaluation \cite{mcp_atlas}. Our ontology centers on four core classes: \emph{Server}, \emph{Tool}, \emph{Capability}, and \emph{Parameter}. We map MCP schemas into RDF triples, validate the graph with SPARQL, and use it as a discovery layer before execution. Our evaluation supports a more precise claim than ``KGs outperform text.'' The current KG helps mainly in overload regimes, where the model must choose under too many tools, semantically overlapping descriptions, or provider-specific tool limits. In smaller curated settings, unstructured text remains better because it preserves recall and avoids an additional routing stage. This distinction matters for the GenAIK-NORA audience: the contribution is not only an ontology, but also an empirical characterization of when symbolic structure helps agentic tool use and when it does not. This paper makes four contributions:

\begin{enumerate}[leftmargin=1.5em]
\item A lightweight ontology for MCP-style tools that models servers, tools, capabilities, and parameters, while explicitly distinguishing required from optional inputs.
\item A KG construction workflow from real MCP tool schemas to RDF, with validation queries for graph integrity and relation traversal.
\item A comparative evaluation of KG-augmented tool discovery versus text-only tool exposure across multiple models and tool-budget regimes.
\item A fine-grained analysis showing that the KG's main benefit is structure-aware filtering under tool overload, whereas its main failure mode is recall loss from incomplete capability coverage.
\end{enumerate}

\section{Background and Related Work}

\subsection{Tool Schemas and Tool-Use Evaluation}

Recent agent ecosystems have converged on schema-based tool descriptions, typically centered on JSON-style parameter specifications. Practitioner guidance from Anthropic emphasizes that effective tools need clear names, high-signal descriptions, token-efficient responses, and explicit evaluation, because agents are easily confused by overlapping or overly generic tool interfaces \cite{anthropic_tools}. Likewise, tool-schema engineering guides describe JSON Schema as the de facto foundation for specifying name, description, parameter types, required fields, defaults, and constraints \cite{oneuptime_schemas}. These sources are vital since our  ontology is intentionally grounded in the fields that appear consistently across real tool schemas.

Benchmarking work has moved from isolated function-calling tasks toward more realistic agentic evaluation. Berkeley Function Calling Leaderboard (BFCL) evaluates tool and function calling performance across diverse scenarios and has evolved from function-call accuracy toward broader agentic evaluation \cite{bfcl}. ToolSandbox argues that realistic benchmarking requires stateful execution, implicit dependencies between tools, and conversational interaction \cite{toolsandbox}. Tool Playgrounds and StableToolBench likewise highlight the need for large-scale, analyzable, and stable evaluation environments for tool-using agents \cite{toolplaygrounds,stabletoolbench}. Most directly relevant to this paper, MCP-Atlas evaluates tool-use competency with real MCP servers, multi-step workflows, controlled tool exposure, distractors, and claims-based scoring \cite{mcp_atlas}. The public release contains 500 tasks, while the public leaderboard evaluates 1,000 tasks across 36 servers. Our evaluation sits within this broader trend, but asks a different question: whether structured tool representation changes which tools are discovered and therefore which tasks are solvable.

\subsection{Semantic Service Discovery and Composition}

The semantic-web community studied automated service discovery long before LLM agents. OWL-S and related work argued that machine-interpretable service descriptions are required for automatic matching, composition, and invocation of services \cite{owls,owls_ranked}. Other work extended semantic matching with non-functional criteria such as quality-of-service and later adapted ontology-driven discovery to cloud services \cite{semantic_qos,cloud_ontology}. These works are clear predecessors to tool discovery for LLM agents. However, classical semantic service discovery targeted web services and enterprise integration, not the interactive, prompt-driven, context-limited behavior of modern LLM agents. Our work revisits the same semantic discovery problem, but under a new operating constraint: agents must reason over tool interfaces using limited context windows and imperfect natural-language plans.

\subsection{Knowledge Graphs for Agent and Tool Retrieval}

Recent work has begun to combine graph-based retrieval with MCP-style agent ecosystems. Agent-as-a-Graph represents tools and parent agents as nodes in a knowledge graph and reports improvements in Recall@5 and nDCG@5 on a live MCP benchmark \cite{agent_as_a_graph}. That work is closely aligned with the present paper, but focuses on graph-based retrieval for multi-agent systems rather than ontology design for tool schema semantics or a controlled comparison against direct text exposure. More broadly, graph-based reasoning has been used in domain-specific agent systems such as SciAgents, where ontological graphs help organize concepts and support multi-agent reasoning \cite{sciagents}. Our work differs in scope: it targets the representation and retrieval of executable tools themselves.

\section{Problem Statement and Formalization}

Let $\mathcal{S}$ be a set of servers, $\mathcal{T}$ a set of tools, $\mathcal{C}$ a set of capabilities, and $\mathcal{P}$ a set of parameters. We model the tool ecosystem as a typed directed graph:
\vspace{-5 pt}
\[
\mathcal{G} = (\mathcal{V}, \mathcal{E}, \tau_V, \tau_E),
\]

where
\vspace{-7 pt}
\[
\mathcal{V} = \mathcal{S} \cup \mathcal{T} \cup \mathcal{C} \cup \mathcal{P}
\]
\vspace{-7 pt}
and $\tau_V$ and $\tau_E$ assign node and edge types. The graph contains at least the following edge families:
% \vspace{-3 pt}
\begin{align*}
E_{\text{host}} &\subseteq \mathcal{T} \times \mathcal{S} \rightarrow \emph{hostedOn}, \\
E_{\text{cap}} &\subseteq \mathcal{T} \times \mathcal{C} \rightarrow \emph{hasCapability}, \\
E_{\text{req}} &\subseteq \mathcal{T} \times \mathcal{P} \rightarrow \emph{hasRequiredInput} \\
E_{\text{opt}} &\subseteq \mathcal{T} \times \mathcal{P} \rightarrow  \emph{hasInput}, \\
E_{\text{sub}} &\subseteq \mathcal{C} \times \mathcal{C} \rightarrow \rightarrow \emph{capability-parent links}.
\end{align*}
% \vspace{-3pt}
Given a task instance $x \in \mathcal{X}$ with natural-language request $q(x)$, a discovery policy $D$ returns a candidate set of tools
\vspace{-5pt}
\[
R_D(x) \subseteq \mathcal{T}, \quad |R_D(x)| \leq B,
\]

where $B$ is the tool budget that can be passed to the execution agent. An execution agent $A$ then uses only the candidate set $R_D(x)$ to produce an answer $\hat{y}(x)$ and possibly a tool-use trace $\pi(x)$. 
Let $y^*(x)$ denote the reference answer and let
\vspace{-5 pt}
\[
\mathrm{Cov}(\hat{y}(x), y^*(x)) \in [0,1]
\]

be the benchmark coverage score used by the evaluation harness. The system objective is to maximize expected task coverage:
\vspace{-10 pt}
\[
J(D, A) = \mathbb{E}_{x \sim \mathcal{X}}\left[\mathrm{Cov}(\hat{y}(x), y^*(x))\right].
\]
\vspace{-10 pt}

This setup makes the core trade-off explicit. A text-only strategy $D_{\text{text}}$ can have high recall because it exposes many or all tools directly, but it risks overload when $B$ is large or when the model must reason over many overlapping descriptions. A KG-based strategy $D_{\text{kg}}$ can reduce the candidate set by exploiting structure in $\mathcal{G}$, but it may fail if the graph omits relevant capabilities or routes the query to the wrong subgraph. To analyze discovery quality more directly, let $T^*(x) \subseteq \mathcal{T}$ denote the oracle set of tools sufficient for solving task $x$. Then discovery precision and recall are
\vspace{-5 pt}
\[
\mathrm{Prec}_D(x) = \frac{|R_D(x) \cap T^*(x)|}{|R_D(x)|}, \qquad
\mathrm{Rec}_D(x) = \frac{|R_D(x) \cap T^*(x)|}{|T^*(x)|}.
\]

The evidence in our experiments indicates that the current KG achieves very high precision but insufficient recall. This pos assists achieve good at filtering out irrelevant tools, but not yet good enough at recovering all relevant tools for hard multi-step tasks.

\section{Ontology and Knowledge Graph Construction}

\subsection{Ontology Scope}

We intentionally scope the ontology to the entities that matter most for discovery on MCP-style benchmarks. The current graph models four entity types:

\begin{itemize}[leftmargin=1.5em]
\item \textbf{Server}: the MCP server that hosts one or more tools.
\item \textbf{Tool}: the main executable entity, described by a name, label, and natural-language description.
\item \textbf{Capability}: an abstract functional class used for semantic grouping and retrieval.
\item \textbf{Parameter}: reusable node describing an input argument with type and default value when available.
\end{itemize}

\vspace{0.3pt}
\begin{figure}[t]
\centering
\includegraphics[width=0.6\linewidth]{images/data-model.png}
\caption{The Data Model: Implemented core ontology used in the KG prototype.}
\label{fig:class-diagram}
\end{figure}
Figure~\ref{fig:class-diagram} shows the implemented core class diagram. One noteworthy design choice is that requirement status is modeled relationally rather than as a Boolean attribute on the parameter alone. A parameter may be required for one tool and optional for another, so the distinction belongs naturally to the edge type: \emph{hasRequiredInput} versus \emph{hasInput}. This keeps the ontology closer to the semantics of invocation. The graph is tool-centric where the relation is modeled as \emph{Tool} $\rightarrow$ \emph{hostedOn} $\rightarrow$ \emph{Server}, because the tool is the primary retrieval object and this direction eases extension to future non-MCP tool types such as REST APIs or local tools but we keep a reverse relation. Scope wise, tool outputs are excluded from the current core ontology. Whereas MCP input schemas are standardized, output schemas are often runtime-dependent or absent. Output modeling is therefore deferred to future work.

\subsection{Capability Taxonomy}

Capabilities are the semantic bridge between raw tool schemas and higher-level tool discovery. Tool selection is driven primarily by what a tool can accomplish, not by the surface form of its name. We therefore organize capabilities into a lightweight hierarchy that reflects recurring functional categories across contemporary tool ecosystems and benchmarks. The current taxonomy includes top-level clusters such as information access, content generation, data processing, and system interaction, with leaf capabilities including web search, database querying, external API access, text generation, code generation, image generation, information transformation, computation, file management, and shell execution. Figure~\ref{fig:capability-taxonomy} shows the capability taxonomy used for this clustering. This taxonomy serves two roles. First, it supports retrieval by grouping tools that are semantically related even when their names differ. Second, it provides an abstraction layer that can eventually align MCP-native tools with alternative representations such as OpenAPI or other agent-tool registries.
\begin{figure}[t]
\centering
\includegraphics[width=1.0\linewidth]{images/taxonomy.pdf}
\caption{Capability taxonomy used to organize tool functionality into retrieval-relevant semantic classes.}
\label{fig:capability-taxonomy}
\end{figure}

\subsection{ETL Pipeline and Validation}

The KG is built from real MCP tool schemas in the MCP-Atlas environment \cite{mcp_atlas}. We initially validated the mapping on representative servers such as Wikipedia, whose tools include search, retrieval, summarization, and section-level operations. The same mapping procedure is then applied across benchmark servers and checked with SPARQL queries for relation traversal, inverse consistency, hierarchy traversal, and basic data integrity. The implemented mapping strategy is intentionally lightweight:

\begin{enumerate}[leftmargin=1.5em]
\item Extract raw MCP tool schemas from server manifests or JSON outputs.
\item Normalize tool names, descriptions, server metadata, and JSON-schema parameter fields.
\item Create parameter nodes with type and default metadata.
\item Attach parameters using either \emph{hasRequiredInput} or \emph{hasInput}.
\item Map each tool to one or more capabilities from the taxonomy.
\item Serialize the resulting graph to RDF and validate it using SPARQL.
\end{enumerate}

In the current system, only tools with assigned capabilities are included in the KG. This choice improves precision, because unsupported tools are excluded, but it also creates a recall bottleneck whenever relevant tools have not yet been assigned a capability. Figure~\ref{fig:instance-graph} shows an instance-level Mermaid rendering used to validate mappings for representative tools from the Wikipedia server.
\vspace{-5 pt}
\begin{figure}[t]
\centering
\includegraphics[width=0.75\linewidth]{images/wikipedia_instance.pdf}
\caption{Instance-level validation of the ontology using Wikipedia MCP tools. The example highlights server, tool, capability, and parameter mappings.}
\label{fig:instance-graph}
\end{figure}

\section{KG-Augmented Tool Discovery Workflow}

The KG-augmented workflow separates \emph{discovery} from \emph{execution}. Instead of presenting the downstream agent with all tool descriptions up front, the system first queries the graph to retrieve a small candidate subset based on task semantics, capability associations, and structural metadata. The execution agent then operates over that reduced tool set. At a high level, the workflow is:

\begin{enumerate}[leftmargin=1.5em]
\item Parse the task request and identify likely capabilities, data sources, or action types.
\item Traverse the graph to retrieve tools linked to those capabilities and their hosting servers.
\item Inspect tool descriptions and parameter schemas for a small number of candidates.
\item Return the final candidate set to the execution agent.
\item Execute one or more MCP tools and produce the final answer.
\end{enumerate}

This is qualitatively different from a text-only baseline that loads all available tools or a selected subset of them directly into the agent context. The KG workflow performs explicit candidate compression before the model commits to execution. In the all-tools regime, this reduces the average candidate set from roughly 270 tools to approximately 4.6 tools in the Claude 4.6 Sonnet setting. Such compression is valuable because practical tool use is constrained not only by retrieval quality but also by context limits and provider-specific caps on the number of tools that can be supplied at once.

\section{Experimental Setup}

\subsection{Benchmark and Protocol}

We evaluate on MCP-Atlas, a large-scale benchmark for tool-use competency with real MCP servers \cite{mcp_atlas}. MCP-Atlas is designed for realistic agentic workflows rather than isolated function calls: tasks are written in natural language, typically require 3--6 tool calls, avoid naming the target tool directly, and are scored with a claims-based coverage metric. Each benchmark instance provides a prompt, a task-specific enabled tool menu, ground-truth claims, and a tool-call trajectory for diagnosis. The public release contains 500 tasks, while the leaderboard evaluates 1,000 tasks across 36 servers spanning search, analytics, productivity, finance, and coding. The reported runs use the MCP-Atlas environment and scoring methodology. Under the active server configuration for our experiments, the executable benchmark slice covers 258 tasks. In the overload regime, the accessible registry expands to approximately 269 tools, which is large enough to expose the retrieval bottleneck that this paper targets.

\subsection{Compared Conditions}

We compare two retrieval settings. \textbf{\textsc{Selected-tools setting:}} This setting is closest to the standard MCP-Atlas protocol. The execution agent receives a smaller task-level tool menu rather than the full registry. It tests whether KG mediation still helps when tool overload is already controlled.\\
\textbf{\textsc{All-tools setting:}} This is an additional stress test beyond the default benchmark configuration. The KG agent discovers tools from the full graph, while the text baseline is exposed to the full executable registry directly, subject to provider-specific limits. This regime stresses context saturation and retrieval ambiguity. GPT-5 is especially informative because the text baseline was capped at 128 tools, while the KG pipeline could still search over the full tool pool.

\subsection{Models and Metric}

The reported runs cover Claude 4.6 Sonnet, Claude 4.6 Opus, GPT-5, and Gemini 2.5 Pro. We report mean coverage on a $[0,1]$ scale, because it is more sensitive than pass rate to partial task completion and near misses. MCP-Atlas defines a task as passed when coverage is at least 0.75, which is appropriate for leaderboard ranking; mean coverage is the more informative measure here because the intervention affects discovery quality before it changes binary task success. We also analyze task outcomes, tool-use traces, and candidate-set sizes in order to explain not only whether the KG helps, but why.


\begin{table}[t]
\centering
\caption{Mean coverage scores for KG-based discovery versus text-only tool exposure. Positive $\Delta$ means the KG is better.}
\label{tab:main-results}
\small
\begin{tabular}{p{0.23\linewidth}p{0.22\linewidth}r r r}
\toprule
Model & Setting & Text & KG & $\Delta$ \\
\midrule
Claude 4.6 Sonnet & Selected tools & 0.778 & 0.713 & -0.065 \\
GPT-5 & Selected tools & 0.658 & 0.584 & -0.074 \\
Gemini 2.5 Pro & Selected tools & 0.201 & 0.088 & -0.113 \\
Claude 4.6 Opus & Selected tools & 0.871 & 0.814 & -0.057 \\
\midrule
Claude 4.6 Sonnet & All tools & 0.645 & 0.575 & -0.070 \\
GPT-5 & All tools & 0.478 & 0.542 & +0.064 \\
Gemini 2.5 Pro & All tools & 0.152 & 0.102 & -0.050 \\
Claude 4.6 Opus & All tools & 0.764 & 0.653 & -0.111 \\
\bottomrule
\end{tabular}
\end{table}

% \subsection{Overall Comparison}
\section{Results}
Table~\ref{tab:main-results} reports the mean coverage scores from our MCP-Atlas runs, figure~\ref{fig:coverage-histograms-side-by-side} shows the distribution of coverage on two models a(claude Sonnet) and b(GPT-5). The key observation is that the KG is \emph{not} uniformly superior. In selected-tools setting, the text baseline outperforms the KG for every tested model suggesting that when the tool space has already been narrowed sufficiently, direct text descriptions remain easier for frontier models to exploit than an additional discovery layer. Secondly, for tool overload in the all-tools setting, GPT-5 improves under KG-based filtering, moving from 0.478 to 0.542 mean coverage. This indicates that a structured retrieval layer can recover useful tools more effectively than direct flat exposure in overloaded contexts. Finally,  raw model capability remains a dominant factor. Claude 4.6 Opus is the strongest model in both settings, while Gemini 2.5 Pro performs poorly regardless, meaning that the KG should be is not a standalone substitute for strong tool reasoning.

The Claude 4.6 Sonnet all-tools condition is especially informative because task-level analysis is available for that run. In that setup, both approaches had access to approximately 270 tools, but the KG reduced the discovered set to about 4.6 tools on average. Despite this roughly 98\% reduction in candidate tools, the KG still achieved 0.575 mean coverage versus 0.645 for the text baseline, or about 89\% of the baseline's performance. This trade-off shows that structured filtering can preserve most of the baseline's effectiveness even while dramatically compressing the search space. For systems deployed with tight tool budgets, high latency, or a need for interpretable retrieval, that may be a worthwhile exchange even before the KG surpasses the text baseline outright.

\begin{figure}[t]
\centering

\begin{subfigure}[t]{0.49\linewidth}
    \centering
    \includegraphics[width=\linewidth]{images/coverage_histogram_comparison_claude46sonnet.png}
    \caption{Coverage histogram comparison for Claude 4.6 Sonnet.}
    \label{fig:coverage-claude-sonnet}
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.49\linewidth}
    \centering
    \includegraphics[width=\linewidth]{images/coverage_histogram_comparison_gpt5_all_tools.png}
    \caption{Coverage histogram comparison for GPT-5 in the all-tools setting.}
    \label{fig:coverage-gpt5-all-tools}
\end{subfigure}

\caption{Task-level coverage distributions for the two most discussion-relevant experimental settings. Panel (a) shows Claude 4.6 Sonnet, while panel (b) shows GPT-5 under all-tools exposure.}
\label{fig:coverage-histograms-side-by-side}
\end{figure}


\subsection{Why and When the KG Helps}

Task-level inspection points to three main scenarios in which the KG helps.
\vspace{-10 pt}
\paragraph{Name and namespace ambiguity.} When many tools overlap lexically, the text baseline can be misled by superficial name similarity or naming conventions. Capability-and server-aware retrieval helps by grouping tools semantically rather than relying only on surface form.
\vspace{-10 pt}
\paragraph{Backend overload and wrong-source confusion.} In large tool pools, the text baseline sometimes chooses the wrong backend, for example using MongoDB where Airtable or Notion would be more appropriate. The KG helps by steering retrieval toward a capability-consistent region of the tool space.
\vspace{-10 pt}

\subsection{Why the KG Still Loses in Many Cases}

The same error analysis also identifies why the KG underperforms in the other conditions.
\vspace{-10 pt}
\paragraph{\textsc{The Recall bottleneck:}} Discovered tools are almost always correct, with precision around 98\% to 99.9\% in the selected-tools setting, but recall is only around 25\%. This means the KG often returns a clean but incomplete tool set.
\vspace{-10 pt}
\paragraph{Capability assignment errors steer the graph incorrectly.} The most frequent issue in the Claude 4.6 Sonnet all-tools analysis was wrong data-source routing, especially when the graph steered tasks toward MongoDB or Airtable even though the task depended on local CSV files. This this indicates a failure of semantic coverage and routing quality in the current graph.
\vspace{-10 pt}
\paragraph{Some losses are implementation bugs rather than conceptual limits.} Task traces expose fixable sources of failure such as a TwelveData parameter-wrapper mismatch, the wrong filesystem base path, and under-routing for Airtable. These issues likely understate the KG's eventual ceiling.

\begin{table}[t]
\centering
\caption{Task-level outcome breakdown for the Claude 4.6 Sonnet all-tools condition.}
\label{tab:error-analysis}
\small
\begin{tabular}{p{0.2\linewidth}p{0.72\linewidth}}
\toprule
Outcome & Main observations \\
\midrule
KG wins (56 tasks, 22\%) & Better backend selection under overload, fewer naming confusions, fewer failures from web-search blocking and security-rule issues. \\
Tie (119 tasks, 46\%) & Mostly simpler single-tool tasks such as weather, museum search, or direct file reads where both approaches succeed. \\
Text wins (83 tasks, 32\%) & Wrong data-source steering, asking the user for file paths instead of discovering them, upstream HTTP 500 errors, and schema mismatches. \\
\bottomrule
\end{tabular}
\end{table}

The divergent tasks are also qualitatively harder. They involve cross-domain tool chaining, file system exploration, computation, and multi-step planning. This matters because it suggests that the KG's missing recall is most costly on precisely the tasks where semantic abstraction should matter most.

\section{Discussion}
\vspace{-5 pt}
The results reveal a succinct interpretation of KG-based tool discovery beyond average-score comparison.
\vspace{-15 pt}
\paragraph{1.) The KG is best understood as a filtering mechanism, not yet a superior discovery oracle.} The one clear positive result appears exactly in the regime where the text baseline is overloaded. This is consistent with both the GPT-5 all-tools result and the Claude Sonnet observation that the KG can compress roughly 270 tools to 4.6 candidates while preserving most of the baseline's task coverage.
\vspace{-10 pt}
\paragraph{2.) The main structural benefit is semantic factorization.} A flat registry forces the agent to reason over tool names and descriptions directly. The KG introduces intermediate abstractions such as capabilities, parent-child relations, and server provenance. These abstractions reduce candidate entropy and make tool selection more interpretable. In effect, the KG transforms tool retrieval from ``search the entire registry'' into ``enter the right semantic neighborhood, then inspect a few local candidates.''
\vspace{-10 pt}
\paragraph{3.) High precision is not enough.} A graph can still lose badly if it filters out a necessary tool. The current system already seems able to avoid many irrelevant candidates, but because the graph includes only capability-mapped tools and some mappings are incomplete, it often narrows the search space too aggressively. This is why the KG underperforms in curated settings, where the text baseline already enjoys manageable context and can afford broader exploration.
\vspace{-10 pt}
\paragraph{4.) Model behavior and graph quality interact strongly.} In our runs, Claude Opus inspects the KG far more aggressively during discovery than Gemini, with roughly 25 times more detail-inspection calls in one comparison. This suggests that the KG is not a plug-in improvement that works identically for every model. A model must still know when to inspect, when to traverse, and when to stop filtering.
\\

In summary, these findings imply that KGs are most promising when tool registries are large, heterogeneous, and semantically overlapping; when providers impose a hard limit on the number of tools that can be exposed; or when organizations need interpretable retrieval and provenance. They are less compelling on tasks with a small curated tool set or when capability coverage is incomplete. 

\section{Conclusion and Future Work}

We presented a knowledge graph-centered approach to representing and retrieving agentic tools, instantiated on MCP-Atlas with a focus on MCP servers, and an additional overload condition derived from the same environment. The ontology models tools, servers, capabilities, and parameters, and uses the graph structure to prioritize discovery over full execution semantics. We omit output schemas, preconditions, side effects, or long-horizon composition patterns in the current model. This matters for the hardest cross-domain tasks, where success depends on more than choosing the right entry point. 
 
Empirically, our tools knowledge graph does not universally outperform text descriptions for tool discovery, which remains an active research question for the community. This limitation can be attributed to current limited graph coverage as opposed to the KG formalization. Missing or imperfect capability assignment has an impact on the recall even when the retrieved tools are highly precise. However, we observe a more useful result: KG-based retrieval becomes valuable under tool overload, where flat textual exposure saturates the model's practical tool budget. This finding authoritatively indicates that this direction provides structure-aware compression, provenance, and controllable retrieval for large, noisy tool registries. Expanding and automatically maintaining capability coverage is therefore the main technical requirement for turning KG-based discovery from a useful filter into a consistently superior retrieval layer. Additionally, there is a need to investigate the best approaches to representing the tools KG for improved KG-based search of tools, then evaluate the same approach against stronger retrievers and larger live MCP registries.


\section*{Declaration on Generative AI}

During the preparation of this manuscript, the authors used a generative AI system to help organize the paper structure and generate initial draft text. After using this tool or service, the authors reviewed, verified, and edited the content as needed and take full responsibility for the publication's content.

\bibliography{sample-ceur}

\end{document}
