EvolArena: An Evolving Arena for Multi-Turn Reasoning in LLMs

Xiaoyuan Li; Keqin Bao; Yubo Ma; Moxin Li; Wenjie Wang; Kexin Yang; Rui Men; Yichang Zhang; Fuli Feng; Dayiheng Liu

EvolArena: An Evolving Arena for Multi-Turn Reasoning in LLMs

Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Kexin Yang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu

16 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-turn, reasoning, evaluation

TL;DR: We introduce an evolving arena to evaluate LLMs' multi-turn reasoning capabilities, featuring an automated framework and adjustable difficulty levels.

Abstract: Recent advances in LLMs have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present EvolArena, an Evolving Arena for LLMs' multi-turn reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, EvolArena covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, EvolArena features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 7478

Loading