RefineBench: Evaluating Refinement Capability in Language Models

Young-Jun Lee; Seungone Kim; Byung-Kwan Lee; Minkyeong Moon; Yechan Hwang; Jong Myoung Kim; Graham Neubig; Sean Welleck; Ho-Jin Choi

RefineBench: Evaluating Refinement Capability in Language Models

Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong Myoung Kim, Graham Neubig, Sean Welleck, Ho-Jin Choi

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 OralEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: Refinement, Large Language Model, Checklist

TL;DR: We propose RefineBench, a benchmark containing 1002 challenging problems across 11 domains that uses a controlled checklist-based evaluation framework. Our benchmark supports two main refinement settings: self-refinement and guided refinement.

Abstract: Can language models (LMs) self-refine their own responses? This question is increasingly relevant as more than 10% of real-world user interactions involve refinement requests (see Appendix F). Yet prior studies have largely tested LMs on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback about what went wrong. The recent advent of reasoning models that exhibit self-reflection patterns in their chain-of-thought further motivates this question. To address it, we introduce RefineBench, a benchmark of 1,002 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-4.1 achieve modest baseline scores of 31.1 and 23.4, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by –0.2%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (>70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine effectively when their initial responses are incorrect, and that RefineBench provides a valuable testbed for tracking progress.

Supplementary Material: zip

Submission Number: 84

Loading