Inducing Faithfulness in Structured Reasoning via Counterfactual Sensitivity

04 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM reasoning
Abstract: The reasoning processes of large language models often lack faithfulness; a model may generate a correct answer while relying on a flawed or irrelevant reasoning trace. This behavior, a direct consequence of training objectives that solely reward final-answer correctness, severely undermines the trustworthiness of these models in high-stakes domains. This paper introduces $\textbf{Counterfactual Sensitivity Regularization (CSR)}$, a novel training objective designed to forge a strong, causal-like dependence between a model's output and its intermediate reasoning steps. During training, CSR performs automated, operator-level interventions on the generated reasoning trace (e.g., swapping "+" with "-") to create a minimally-perturbed counterfactual. A regularization term then penalizes the model if this logically flawed trace still yields the original answer. Our efficient implementation adds only $8.7\%$ training overhead through warm-start curriculum and token-subset optimization. We evaluate faithfulness using $\textbf{Counterfactual Outcome Sensitivity (COS)}$, a metric quantifying how sensitive the final answer is to such logical perturbations. Across diverse structured reasoning benchmarks-arithmetic (GSM8K), logical deduction (ProofWriter), multi-hop QA (HotpotQA), and code generation (MBPP)-models trained with CSR demonstrate a vastly superior trade-off between accuracy and faithfulness. CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points, with this learned sensitivity generalizing to larger models and enhancing the performance of inference-time techniques like self-consistency. To demonstrate the broader applicability of this principle, we conduct a pilot study on the HellaSwag commonsense reasoning task, showing that a semantic version of CSR (using causal connectives, temporal markers, and key entities as operators) can significantly improve faithfulness there as well.
Primary Area: causal reasoning
Submission Number: 1829
Loading