Keywords: ai, llms, reasoning, chain of thought, faithfulness, fine-tuning, dpo
TL;DR: We introduce a pipeline to train models for increased Chain-of-Thought faithfulness without human intervention.
Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful tool for improving large language model performance on complex tasks, but recent work shows that generated rationales often contain unfaithful steps that are disconnected from the final answer. Prior approaches focus primarily on \emph{measuring} faithfulness, while methods for \emph{improving} it are lacking. We introduce \textbf{Faithful Reasoning via Intervention Training (FRIT)}, a scalable alignment method that enforces causal consistency between reasoning steps and outcomes. FRIT constructs synthetic counterfactual training data by systematically intervening on individual steps in generated CoTs, producing faithful/unfaithful pairs without human supervision. Using this data, we apply Direct Preference Optimization to improve reasoning reliability in Qwen3-8B and Mistral-7B-v0.1 across both factual and symbolic domains. Experiments show that FRIT substantially reduces unfaithful reasoning on both models while also increasing accuracy. Our results highlight FRIT as the first scalable intervention-based framework for training language models to produce more trustworthy and interpretable rationales.
Submission Number: 97
Loading