ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

ACL ARR 2025 February Submission7085 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to robustly evaluate the reasoning capability of LLMs. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces data contamination impact. The examples of our dataset is available at https://anonymous.4open.science/r/ThinkBench-Review/.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking; automatic creation and evaluation of language resources; automatic evaluation of datasets; evaluation methodologies

Contribution Types: Data resources

Languages Studied: English

Submission Number: 7085

Loading