C²RBench: A Chinese Complex Reasoning Benchmark for Large Language Models

C²RBench: A Chinese Complex Reasoning Benchmark for Large Language Models

ACL ARR 2025 February Submission2195 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have achieved remarkable progress in autonomous reasoning, evolving from basic text processing to sophisticated multimodal reasoning, a critical capability for general-purpose AI assistants. However, existing benchmarks usually fail to adequately capture the intricate multi-step reasoning demands inherent in real-world scenarios. To bridge this gap, we propose **C²RBench**: a **C**hinese **C**omplex **R**easoning **Bench**mark for evaluating multi-step, multimodal advanced reasoning capability of LLMs. C²RBench comprises 1,115 carefully curated Chinese tasks, which are organized into eight domain-specific subsets, each meticulously designed to mirror real-world challenges. This hierarchical benchmark features three difficulty tiers based on the number of reasoning steps required (average 8.44 steps per task), significantly exceeding existing benchmarks in cognitive complexity. Extensive evaluations of 16 LLMs (including DeepSeek-R1) and 20 multimodal large language models (MLLMs) on C²RBench reveal critical performance gaps: GPT-4o achieves only 45.20% accuracy, indicating substantial room for improvement.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking

Contribution Types: Data resources

Languages Studied: Chinese

Submission Number: 2195

Loading