Keywords: multi-turn, reasoning, evaluation
TL;DR: We introduce an evolving arena to evaluate LLMs' multi-turn reasoning capabilities, featuring an automated framework and adjustable difficulty levels.
Abstract: Recent advances in LLMs have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present EvolArena, an Evolving Arena for LLMs' multi-turn reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, EvolArena covers diverse reasoning capabilities,
fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, EvolArena features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 7478
Loading