ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

Vaskar Nath; Pranav Vishnu Raja; Jane Yu; Claire Yoon; Sean M. Hendryx

ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

Vaskar Nath, Pranav Vishnu Raja, Jane Yu, Claire Yoon, Sean M. Hendryx

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 SpotlightEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: tool-augmented LLMs, reasoning, tool-use, react, benchmark, agents

TL;DR: ToolComp is a benchmark designed to evaluate complex, multi-step tool-use reasoning tasks through human-edited/verified prompts, final answers, and process supervision labels.

Abstract: Despite recent advances in AI, the development of systems capable of executing complex, multi-step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real-world complexity of tool-use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human-edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families and 20 total models demonstrates the challenging nature of our dataset, with an average accuracy of 55\% among the frontier models.

Supplementary Material: zip

Submission Number: 39

Loading