Abstract: LLM-based agents can greatly extend the abilities of LLMs and thus attract sharply increased studies.
An ambitious vision -- serving users by manipulating massive API-based tools -- has been proposed and explored. However, we find a widely accepted evaluation mechanism for generic agents is still missing.
This work aims to fill this gap.
We decompose tool use capability into seven aspects and form a thorough evaluation schema.
In addition, we design and release an instruction dataset and a toolset -- the two sides that the agents bridge between -- following the principle of reflecting real-world challenges.
Furthermore, we evaluate multiple generic agents. Our findings can inspire future research in improving LLM-based agents and rethink the philosophy of API design.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: NLP datasets, evaluation methodologies, evaluation
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: english
Submission Number: 5041
Loading