ThinkHalu: A Benchmark for Detecting Thinking-Process Hallucinations in Large Reasoning Models

ThinkHalu: A Benchmark for Detecting Thinking-Process Hallucinations in Large Reasoning Models

ACL ARR 2025 May Submission3549 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Reasoning Models (LRMs) have achieved notable progress in both information retrieval and complex reasoning tasks. However, LRMs often suffer from reasoning hallucinations, which compromise the accuracy and interpretability of their outputs. Although the "LLM-as-a-Judge" paradigm shows potential in evaluating factual hallucinations, it still lacks systematic exploration and standardized benchmarks for identifying thinking process hallucinations. To address this, we construct a new benchmark dataset \textbf{ThinkHalu}, covering cognitive hallucinations (miscomprehension of the question) and logical hallucinations (logical errors in reasoning steps). Our dataset uses GSM8K and MathQA as base datasets and generates hallucinated thinking processes with eight open-source LRMs. We select high-quality data by analyzing the semantic similarity and natural language inference (NLI) relationships between hallucinated thinking and correct solution processes. Then, we identify and analyze hallucination subtypes using an LLM-based voting mechanism. We conduct experiments on 20 open-source LLMs and observe that they tend to misclassify cognitive hallucinations as logical hallucinations, indicating limitations in their ability to detect hallucinations in thinking processes.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,hallucination detection

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 3549

Loading