TaskBench: Benchmarking Large Language Models for Task Automation

Yongliang Shen; Kaitao Song; Xu Tan; Wenqi Zhang; Kan Ren; Siyu Yuan; Weiming Lu; Dongsheng Li; Yueting Zhuang

TaskBench: Benchmarking Large Language Models for Task Automation

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, Yueting Zhuang

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: LLM, Task Automation, Autonomous Agents

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Recently, the incredible progress of large language models (LLMs) has ignited the spark of task automation, which decomposes the complex tasks described by user instructions into sub-tasks, and invokes external tools to execute them, and plays a central role in autonomous agents. Therefore, there has been an urgent demand to formulate a systematic and standardized benchmark to foster the development of LLMs in task automation. To this end, we introduce TaskBench to evaluate task automation. Specifically, the process of task automation can be formulated as three critical stages (i.e., task decomposition, tool invocation, and parameter prediction) to fulfill user intent, that renders its data collection more challenging than common NLP tasks. Here, we introduce the concept of Tool Graph to represent the decomposed tasks in user intent, and adopt a back-instruct method to generate user instruction. Moreover, the mechanism of task automation also drives us to formulate more advanced metrics to measure the capability of LLMs. Therefore, we further propose TaskEval to evaluate the capability of LLMs in our curated datasets from different aspects, including task decomposition, tool invocation, and parameter prediction. Experimental results demonstrate that TaskBench can effectively be utilized to reflect the capability of LLMs in task automation. The code and datasets of TaskBench are available in the supplementary material.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6761

Loading