A Survey on Chain-of-Thought Reasoning Evaluation in Large Language Models: What, How, Who, and Where

ACL ARR 2026 January Submission5718 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain-of-Thought Reasoning, Evaluation, Comprehensive Survey
Abstract: As large language models evolve from conversational agents to reasoning engines, Chain-of-Thought (CoT) evaluation has become pivotal, yet remains fragmented and heavily reliant on final accuracy. This outcome bias often masks \textbf{\textit{disguised accuracy}}, failing to distinguish faithful reasoning from post-hoc rationalization. Despite the urgency, a comprehensive survey systematizing these process-oriented assessment techniques remains absent. To fill this gap, we present a unified framework organizing the literature along four dimensions: $\textbf{\textit{what to evaluate}}$, $\textbf{\textit{how to evaluate}}$, $\textbf{\textit{who evaluates}}$, and $\textbf{\textit{where to evaluate}}$. We formalize core quality metrics and systematically analyze evaluation methodologies across qualitative and quantitative paradigms within diverse application domains. By synthesizing these approaches to improve CoT reasoning reliability, this survey provides guidance for building robust evaluation pipelines and highlights key frontiers for future research.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Language Modeling
Contribution Types: Position papers
Languages Studied: English
Submission Number: 5718
Loading