Abstract: Large language models (LLMs) have achieved remarkable progress in autonomous reasoning, evolving from basic text processing to sophisticated multimodal reasoning, a critical capability for general-purpose AI assistants. However, existing benchmarks usually fail to adequately capture the intricate multi-step reasoning demands inherent in real-world scenarios. To bridge this gap, we propose **C²RBench**: a **C**hinese **C**omplex **R**easoning **Bench**mark for evaluating multi-step, multimodal advanced reasoning capability of LLMs. C²RBench comprises 1,115 carefully curated Chinese tasks, which are organized into eight domain-specific subsets, each meticulously designed to mirror real-world challenges. This hierarchical benchmark features three difficulty tiers based on the number of reasoning steps required (average 8.44 steps per task), significantly exceeding existing benchmarks in cognitive complexity. Extensive evaluations of 16 LLMs (including DeepSeek-R1) and 20 multimodal large language models (MLLMs) on C²RBench reveal critical performance gaps: GPT-4o achieves only 45.20% accuracy, indicating substantial room for improvement.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking
Contribution Types: Data resources
Languages Studied: Chinese
Submission Number: 2195
Loading