Self-Correction Bench: Revealing the Self-Correction Blind Spot in LLMs

Ken Tsui

Self-Correction Bench: Revealing the Self-Correction Blind Spot in LLMs

Ken Tsui

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: self-correction, llm

Abstract: Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 open-source non-reasoning models, we find an average 64.5% blind spot rate. Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.

Submission Number: 139

Loading