From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

Shubham Mishra; Shiv Tiwari; Samyek Jain; Gorang Mehrishi; Dhruv Kumar; Pratik Narang; Harsh Sharma

From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

Shubham Mishra, Shiv Tiwari, Samyek Jain, Gorang Mehrishi, Dhruv Kumar, Pratik Narang, Harsh Sharma

Published: 08 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: retrieval-augmented generation, conflict-aware reasoning, trustworthy RAG systems, evidence adjudication, interpretable reasoning traces, grounded question answering, QLoRA, citation grounding, refusal and abstention, chain-of-thought, temporal and subjective conflicts, evaluation of grounding and correctness, LLM-as-a-judge evaluation, supervised fine-tuning, behavioural adherence

TL;DR: A structured reasoning-trace framework and evaluation pipeline that makes RAG robust to conflicting or outdated evidence by enabling grounded synthesis or justified refusal.

Abstract: Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages: (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals.A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establishes a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning (SFT) improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.

Submission Number: 86

Loading