Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistants

ACL ARR 2025 February Submission1392 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, multi-agent frameworks based on large language models (LLMs) have been developed rapidly. However, datasets to evaluate these multi-agent frameworks haven’t been sufficiently developed. We present Auto-SLURP, a dataset designed to evaluate LLM-based multi-agent frameworks and assess whether they can support smart personal assistants. The dataset is derived from the SLURP dataset, which is originally created to train and test models' natural language understanding capabilities. We evaluate the entire end-to-end process for smart personal assistants, from language understanding to operation execution and also response generation, by relabeling the data and incorporating simulated servers and external services. This benchmark dataset proves sufficiently challenging to test the state-of-the-art multi-agent frameworks. Experiment results show that we are still a few steps away from achieving a reliable and smart personal assistant through multi-agent frameworks.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: large language model, multi-agent framework, smart personal assistant, dataset benchmark
Contribution Types: Data resources
Languages Studied: English
Submission Number: 1392
Loading