Retrieval Without Consensus: Quantifying Inter-LLM API Ranking Divergence in Multi-Agent Reasoning Systems
Keywords: LLM Agents, Tool Use, API Discovery, Web APIs, Reliability, Trustworthiness, Ranking Evaluation, Agentic Systems
TL;DR: We evaluated LLM agents' ability to find APIs and found that their recommendations are often unreliable, but this "disagreement" itself is a strong signal to predict and prevent failures.
Abstract: Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement re-main poorly characterized. We present a unified benchmarking framework to quantify inter-LLM divergence—the extent to which models differ in API discovery and ranking under identical tasks. Across 15 canonical API domains and 5 major model families, we measure pairwise and group-level agreement using set-, rank-, and consensus-based metrics: Average Overlap, Jaccard, Rank-Biased Overlap, Kendall’s τ/W, and Cronbach’s α. Results show moderate overall alignment (AO ≈ 0.50, τ ≈ 0.45) but strong domain dependence: structured tasks (Weather, Speech-to-Text) are stable, while open-ended ones (Sentiment Analysis) diverge sharply. Volatility and consensus analyses reveal that coherence clusters around data-bound domains and degrades for abstract reasoning. These insights enable reliability-aware orchestration in multi-agent systems, where consensus weighting can improve coordination among heterogeneous LLMs.
Submission Number: 30
Loading