RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

ICLR 2026 Conference Submission22402 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Faithfulness, Large Reasoning Models, Benchmark

TL;DR: RFEval introduces a benchmark and evaluation framework that probes Large Reasoning Models with counterfactual reasoning interventions to measure reasoning faithfulness—via stance consistency and causal influence—separately from final-answer accuracy.

Abstract: Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for *reasoning faithfulness*, defined by two testable conditions: *stance consistency* (a coherent stance linking reasoning to answer) and *causal influence* (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present **RFEval**, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7\% of outputs, predominantly from post-intervention stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with training paradigms than scale: hybrid pipelines combining diverse supervised fine-tuning with reinforcement learning are more faithful, while size alone is not predictive. Crucially, *accuracy is neither necessary nor sufficient for faithfulness*: once controlling for model and task, the accuracy–faithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process.

Primary Area: datasets and benchmarks

Submission Number: 22402

Loading