Keywords: asynchronous tool call;benchmark
Abstract: Large language models (LLMs) based agents have demonstrated strong proficiency
in leveraging external tools to address complex problems. However, existing evalu-
ations largely overlook the temporal dimension of tool invocation, particularly the
practical impact of inherent tool response latency, and they are typically confined
to single-task scenarios. In realistic applications, tasks often need to be executed in
parallel, and overall efficiency critically depends on the ability to utilize idle time
during tool response delays. We denote this capability as asynchronous tool calling.
To address the lack of evaluation in this area, we propose ASYNCTOOL, which, to
the best of our knowledge, is the first benchmark specifically aimed at assessing
the asynchronous multitasking abilities of LLM-based agents within interactive
tool-use contexts. ASYNCTOOL consists of composite tasks with intra-task step
dependencies that must be executed concurrently while incorporating realistic tool
response delays. Through a hybrid data evolution strategy, we construct a diverse
and representative asynchronous multitasking dataset that covers multiple scenarios
and exhibits a wide range of tool use patterns. We further assess performance from
three levels, namely Step Level, Sub-Task Level, and Task Level, covering perspec-
tives from fine-grained to coarse-grained. Extensive experiments on ASYNCTOOL
show that even state of the art models experience notable performance degradation
when confronted with complex asynchronous workflows. Our analysis identifies
the main failure modes of current tool agents and provides practical guidelines
for designing future systems with stronger temporal reasoning and coordination
capabilities.
Primary Area: datasets and benchmarks
Submission Number: 2352
Loading