ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

Published: 23 Sept 2025, Last Modified: 07 Dec 2025FoRLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain of Thought/Reasoning models, Understanding high-level properties of models, Reinforcement learning
TL;DR: We introduce ReFIne, a training framework that makes reasoning models more interpretable, faithful, and reliable
Abstract: Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: \emph{interpretability}, \emph{faithfulness}, and \emph{reliability}. To this end, we propose \textbf{\texttt{ReFIne}}, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve \emph{interpretability} by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance \emph{faithfulness} by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote \emph{reliability} by providing self-assessments of both the derivation's soundness and the confidence of the final answer. We apply \textbf{\texttt{ReFIne}} to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that \textbf{\texttt{ReFIne}} models generate clearer and better-structured reasoning traces (interpretability +44.0\%), more faithfully expose their underlying decision process (faithfulness +18.8\%), and offer informative confidence estimates (reliability +42.4\%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness.
Submission Number: 79
Loading