Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

Published: 02 Mar 2026, Last Modified: 29 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Hallucination, RLVR, Calibration
TL;DR: We introduce a behaviorally calibrated RL reward that trains LLMs to report calibrated confidence and abstain under user-defined risk preference, yielding large gains in hallucination mitigation against GPT-5.
Abstract: The deployment of Large Language Models (LLMs) in critical domains is currently impeded by the persistent phenomenon of hallucination—the generation of plausible but factually incorrect assertions. Standard RLVR paradigms, which predominantly utilize binary reward signals, inadvertently incentivize models to function as ''good test-takers'' rather than ''honest communicators''. In this paper, we introduce an alternative reward for behavioral calibration, which trains a model via reinforcement learning to output calibrated probabilities of correctness and to abstain when these probabilities fall below a user-specified risk threshold. The model can either abstain from producing a complete response or flag individual claims for which uncertainty remains. Our approach allows a 4B-parameter model to surpass frontier models in hallucination mitigation, which we demonstrate as a transferable meta-skill that can be decoupled from raw predictive accuracy. When trained on mathematical reasoning tasks, our model achieves a log-scale gain of **0.806** in the Accuracy-to-Hallucination Ratio by rejecting uncertain responses, substantially exceeding GPT-5 (**0.207**) on the challenging BeyondAIME benchmark. When applied at the claim level, our approach further surpasses Gemini-2.5-pro on the same metric. Moreover, the hallucination mitigation capability generalizes to cross-domain factual QA.
Submission Number: 16
Loading