Abstract: Large language models (LLMs) have achieved
remarkable performance on various NLP tasks
and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool utilization capability of LLMs
is still under-explored. In contrast to previous works that evaluate models holistically,
we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning,
retrieval, understanding, and review. Based
on that, we further introduce T-Eval to evaluate the tool-utilization capability step by step.
T-Eval disentangles the tool utilization evaluation into several sub-domains along model
capabilities, facilitating the inner understanding of both holistic and isolated competency
of LLMs. We conduct extensive experiments
on T-Eval and in-depth analysis of various
LLMs. T-Eval not only exhibits consistency
with the outcome-oriented evaluation but also
provides a more fine-grained analysis of the
capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization
ability. The benchmark will be available at
https://github.com/open-compass/T-Eval.
Loading