Abstract: Recently, the incredible progress of large language models (LLMs) has ignited the spark of task automation, which decomposes the complex tasks described by user instructions into sub-tasks, and invokes external tools to execute them, and plays a central role in autonomous agents. However, there lacks a systematic and standardized benchmark to foster the development of LLMs in task automation. To this end, we introduce TaskBench to evaluate the capability of LLMs in task automation. Specifically, task automation can be formulated into three critical stages: task decomposition, tool invocation, and parameter prediction to fulfill user intent. This complexity makes data collection and evaluation more challenging compared to common NLP tasks. To generate high-quality evaluation datasets, we introduce Tool Graph to represent the decomposed tasks in user intent, and adopt Back-Instruct to simulate user instruction and annotations. Furthermore, we propose TaskEval to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction. Our experimental findings reveal that TaskBench effectively measures LLMs' task automation proficiency. Benefiting from the mixture of automated data construction and human verification, TaskBench ensures high consistency with human evaluation, establishing it as a reliable and comprehensive benchmark for LLM-based agents.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
0 Replies
Loading