Keywords: Benchmark, LLM, Math, Evaluation
TL;DR: We transformed the GSM8K benchmark under our novel meta-reasoning paradigm and conducted extensive experiments on series of LLMs
Abstract: In this work, we introduce a novel evaluation paradigm for Large Language Models
(LLMs) that compels them to transition from a traditional question-answering role,
akin to a student, to a solution-scoring role, akin to a teacher. This paradigm, focusing on "reasoning about reasoning," termed meta-reasoning, shifts the emphasis
from result-oriented assessments, which often neglect the reasoning process, to a
more comprehensive evaluation that effectively distinguishes between the cognitive
capabilities of different models. Our meta-reasoning process mirrors "system-2"
slow thinking, requiring careful examination of assumptions, conditions, calculations, and logic to identify mistakes. This paradigm enables one to transform
existed saturated, non-differentiating benchmarks that might be leaked in data pretraining stage to evaluation tools that are both challenging and robust against data
contamination. To prove our point, we applied our paradigm to GSM8K dataset and
developed the MR-GSM8K benchmark. Our extensive analysis includes several
state-of-the-art models from both open-source and commercial domains, uncovering fundamental deficiencies in their training and evaluation methodologies.
Specifically, we found the OpenAI o1 models which possess characteristics of
"system-2" thinking excel the other SOTA models by more than 20 absolute points
in our benchmark, supporting our deficiency hypothesis.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6850
Loading