Abstract: Large Reasoning Models (LRMs) have achieved notable progress in both information retrieval and complex reasoning tasks. However, LRMs often suffer from reasoning hallucinations, which compromise the accuracy and interpretability of their outputs. Although the "LLM-as-a-Judge" paradigm shows potential in evaluating factual hallucinations, it still lacks systematic exploration and standardized benchmarks for identifying thinking process hallucinations.
To address this, we construct a new benchmark dataset \textbf{ThinkHalu}, covering cognitive hallucinations (miscomprehension of the question) and logical hallucinations (logical errors in reasoning steps). Our dataset uses GSM8K and MathQA as base datasets and generates hallucinated thinking processes with eight open-source LRMs. We select high-quality data by analyzing the semantic similarity and natural language inference (NLI) relationships between hallucinated thinking and correct solution processes. Then, we identify and analyze hallucination subtypes using an LLM-based voting mechanism.
We conduct experiments on 20 open-source LLMs and observe that they tend to misclassify cognitive hallucinations as logical hallucinations, indicating limitations in their ability to detect hallucinations in thinking processes.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,hallucination detection
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 3549
Loading