Can large language models independently complete tasks? A dynamic evaluation framework for multi-turn task planning and completion

Published: 01 Jan 2025, Last Modified: 15 May 2025Neurocomputing 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large language models (LLMs) are increasingly relied upon for multi-turn dialogue to conduct complex tasks. However, existing benchmarks mainly evaluate LLMs as agents, overlooking their potential as independent systems to accomplish complex tasks. In addition, these benchmarks typically evaluate the planning and completion capabilities of the models individually, rather than simultaneously. To address these issues, we propose a new Dynamic Evaluation Framework for Multi-Turn task planning and completion (DEF-MT) to assess the ability of LLM to independently complete complex tasks in multi-turn scenarios. Our approach quantifies the model’s planning capability by guiding it to generate planning and responses sequentially. Simultaneously, we use a dynamic approach to generate data that simulates the complex intents of real users. Finally, experiments conducted on 9 mainstream models using the Multiwoz 2.2 dataset, indicate that the existing models’ sub-task planning capabilities hinder their ability to complete complex tasks, providing a meaningful reference for the future optimization direction of LLM.
Loading