Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity

ACL ARR 2026 January Submission135 Authors

22 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Agents, Tool Use, Function Calling, Realistic Benchmark, API Complexity, Robustness Evaluation
Abstract: We introduce ***WildAGTEval***, a benchmark designed to evaluate large language model (LLM) agents’ function-calling capabilities under **realistic API complexity**. Unlike prior work that assumes an **idealized** API system and disregards real-world factors such as noisy API outputs, ***WildAGTEval*** accounts for two dimensions of real-world complexity: ❶ **API specification**, which includes detailed documentation and usage constraints, and ❷ **API execution**, which captures runtime challenges. Consequently, ***WildAGTEval*** offers (i) an API system encompassing 60 distinct complexity scenarios that can be composed into approximately 32K test configurations, and (ii) user-agent interactions for evaluating LLM agents on these scenarios. Using ***WildAGTEval***, we systematically assess several advanced LLMs and observe that most scenarios are challenging, with **irrelevant information** complexity posing the greatest difficulty and reducing the performance of strong LLMs by 27.3%. Furthermore, our qualitative analysis reveals that LLMs occasionally *distort* user intent merely to claim task completion, critically affecting user satisfaction.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Resources and Evaluation, Interpretability and Analysis of Models for NLP, Dialogue and Interactive Systems
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 135
Loading