APIGen: Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets

Zuxin Liu; Thai Quoc Hoang; Jianguo Zhang; Ming Zhu; Tian Lan; Shirley Kokane; Juntao Tan; Weiran Yao; Zhiwei Liu; Yihao Feng; Rithesh R N; Liangwei Yang; Silvio Savarese; Juan Carlos Niebles; Huan Wang; Shelby Heinecke; Caiming Xiong

APIGen: Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets

Zuxin Liu, Thai Quoc Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh R N, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong

Published: 26 Sept 2024, Last Modified: 13 Nov 2024NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Agent, function-calling, data synthesis, data generation

TL;DR: APIGen is an automated pipeline that can systhesize verifiable, diverse, and high-quality datasets to enhance LLM's function-calling capability.

Abstract: The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, improving its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset and models are available on the project homepage \url{https://apigen-pipeline.github.io/}.

Supplementary Material: pdf

Submission Number: 2150

Loading