Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

Published: 08 Nov 2025, Last Modified: 08 Nov 2025ResponsibleFM @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: hallucination, reinforcement learning, retrieval-augmented
TL;DR: We propose to use reinforcement learning with a novel binary retrieval-augmented reward (RAR) to mitigate hallucination.
Abstract: Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. This trustworthiness problem is particularly critical for deployment in high-stakes domains such as healthcare, education, and public policy. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their real-world applicability. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3\% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4\% and 21.7\% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.
Submission Number: 130
Loading