Abstract: Developing datasets and benchmarks for API-augmented Large Language Models (LLMs) is essential to support their integration with external tools and APIs. This systematic literature review analyzed 69 studies across three dimensions: integration tasks, dataset curation, and dataset characteristics. Results show that most datasets focus on API selection and API calling tasks, rely mostly on synthetic data, and often lack natural, diverse, and multi-turn interactions. The review recommends building richer and more realistic datasets that emphasize linguistic diversity, naturalness, context awareness, and hybrid curation combining human and synthetic data, together with improved evaluation methods for benchmarking API-augmented LLMs.
Loading