MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Keywords: Large Language Models;Reasoning;System-2;Slow Thinking;Resource and Evaluation;Analysis and Interpretability
TL;DR: MR-Ben is a comprehensive process-oriented evaluation benchmark comprising 5,975 questions that cover various subjects, ranging from natural science to coding, challenging LLMs to score the reasoning steps of candidate solutions.
Abstract: Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, evaluating these reasoning abilities has become increasingly challenging. Existing outcome-based benchmarks are beginning to saturate, becoming less effective in tracking meaningful progress. To address this, we present a process-based benchmark MR-Ben that demands a meta-reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. Our meta-reasoning paradigm is especially suited for system-2 slow thinking, mirroring the human cognitive process of carefully examining assumptions, conditions, calculations, and logic to identify mistakes. MR-Ben comprises 5,975 questions curated by human experts across a wide range of subjects, including physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, with models like the o1 series from OpenAI demonstrating strong performance by effectively scrutinizing the solution space, many other state-of-the-art models fall significantly behind on MR-Ben, exposing potential shortcomings in their training strategies and inference methodologies.
Supplementary Material: zip
Primary Area: Evaluation (methodology, meta studies, replicability and validity)
Submission Number: 8800
Loading