Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 spotlightEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We target the utility degradation issue that prior hallucination-reduction methods often struggle to avoid, and propose online RL with Binary Retrieval-Augmented Reward to reduce hallucinations while preserving general capabilities.
Abstract: Modern post-trained language models are increasingly capable, but remain prone to extrinsic hallucinations. We target the utility degradation issue that prior hallucination-reduction methods often struggle to avoid, and propose online RL with Binary Retrieval-Augmented Reward (Binary RAR) to reduce hallucinations while preserving general capabilities. Binary RAR assigns a reward of 1 if a response contains no factual contradictions with retrieved evidence, and 0 otherwise. We theoretically show that this method reduces the probability of error-containing responses while preserving the distribution of error-free responses. This helps preserve the model’s capabilities, whereas other methods often degrade them. We evaluate Binary RAR on multiple widely used models. On Qwen3-8B, it reduces long-form hallucination rates by 39.3\% and short-form hallucination rates by 54.4\%, outperforming supervised learning and preference optimization baselines. Our error analysis shows that continuous factuality rewards (e.g., VeriScore) can be exploited via reward hacking by producing fewer or more generic claims, whereas Binary RAR is more robust and better preserves general capabilities, including instruction following, math, and coding.
Lay Summary: AI assistants frequently generate plausible but factually wrong information, a problem called hallucination. Prior fixes reduce hallucinations but make the model worse on utility. We propose Binary RAR: during training, a model earns a reward only if its entire response contains no contradictions with retrieved web documents, and zero otherwise. This all-or-nothing signal is harder to game than scoring responses claim-by-claim. Models can't cheat by giving vaguer or shorter answers. The result is a model that hallucinates up to 54% less while remaining equally capable at coding, math, and following user instructions.
Primary Area: Deep Learning->Large Language Models
Keywords: hallucination, factuality, reinforcement learning, retrieval-augmented generation
Originally Submitted PDF: pdf
Submission Number: 22392
Loading