Keywords: tool-augmented LLMs, reasoning, tool-use, react, benchmark, agents
TL;DR: ToolComp is a benchmark designed to evaluate complex, multi-step tool-use reasoning tasks through human-edited/verified prompts, final answers, and process supervision labels.
Abstract: Despite recent advances in AI, the development of systems capable of executing complex, multi-step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real-world complexity of tool-use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human-edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families and 20 total models demonstrates the challenging nature of our dataset, with an average accuracy of 55\% among the frontier models.
Supplementary Material: zip
Submission Number: 39
Loading