Keywords: LLM Agents, Multi-stage Reasoning, Chain Collaboration, Benchmark
TL;DR: We introduces MSCoRe, a novel benchmark designed to evaluate the multi-stage collaborative reasoning capabilities of LLM agents across complex scenarios.
Abstract: Large Language Model (LLM) agents have excelled in single-stage tasks. However, their reasoning and coordination capabilities in multi-stage scenarios remain underexplored. Existing benchmarks typically focus on isolated tasks or narrow domains, overlooking models' abilities for multi-stage collaboration and optimization without explicit external guidance. To bridge this gap, we propose MSCoRe, a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, e-commerce, and automotive energy sectors. We also introduce a structured three-phase pipeline: dynamic sampling, iterative question-answer generation, and a multi-level quality assessment to generate high-quality data. For a more refined assessment, we categorize tasks into three difficulty levels based on their stage coverage and complexity. With MSCoRe, we have conducted a comprehensive evaluation of various state-of-the-art LLM agents. The commercial models performed best across all tasks and scenarios, but a notable gap in the ROUGE scores remains between simple and complex tasks. We also tested the models' robustness under three types of noisy data and found that their performance is negatively affected by different noise. MSCoRe provides a new resource for evaluation and improvement of multi-stage collaborative reasoning in LLM agents. Codes and data are available at https://huggingface.co/datasets/032564yn/MSCoRe.
Primary Area: datasets and benchmarks
Submission Number: 23212
Loading