A Systematic Literature Review of Datasets and Benchmarks on Integrating LLMs with APIs

Vitor Gaboardi dos Santos, Boualem Benatallah, Fabio Casati

Published: 15 Dec 2025, Last Modified: 16 Dec 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Developing datasets and benchmarks for API-augmented Large Language Models (LLMs) is essential to support their integration with external tools and APIs. This systematic literature review analyzed 69 studies across three dimensions: integration tasks, dataset curation, and dataset characteristics. Results show that most datasets focus on API selection and API calling tasks, rely mostly on synthetic data, and often lack natural, diverse, and multi-turn interactions. The review recommends building richer and more realistic datasets that emphasize linguistic diversity, naturalness, context awareness, and hybrid curation combining human and synthetic data, together with improved evaluation methods for benchmarking API-augmented LLMs.