Beyond Correctness: A Framework for Analyzing Reasoning and Faithfulness of Multi-hop Question Answering

ACL ARR 2026 January Submission5186 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-hop Question Answering, Retrieval-Augmented Generation, Reasoning Evaluation, Contextual Faithfulness, Preference Optimization
Abstract: Diagnosing the root causes of failures in Retrieval-Augmented Generation (RAG) is challenging, as current evaluations often conflate logical reasoning errors with the failure to adhere to retrieved context. To address this, we propose $ReFa$, a framework inspired by Structural Causal Models to explicitly disentangle $Reasoning$ from $Faithfulness$. Through this lens, we characterize two distinct failure modes: the $'Fool'$, representing a deficiency in logical reasoning, and the $'Lazy'$, representing a lapse in faithfulness where the model defaults to parametric priors. To operationalize this diagnosis, we introduce Dual Reasoning Chain Editing, a mechanism that constructs controlled proxy chains to isolate reasoning structure from evidence faithfulness. We apply this method on *ReFaBench*, a novel benchmark constructed in this work featuring factual, counterfactual, and knowledge-conflicting scenarios. Furthermore, we propose *ReFa-DPO*, a decoupled preference optimization strategy that leverages these proxies to target specific failure patterns. Experimental results demonstrate that *ReFa-DPO* enhances robustness, particularly in mitigating parametric interference, by simultaneously enhancing both contextual faithfulness and reasoning capabilities.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: multihop QA, interpretability, question generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 5186
Loading