TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

ICLR 2026 Conference Submission22031 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Truthfulness, Reinforcement Learning
TL;DR: TruthRL incentivizes models not only to provide more correct responses but also to abstain from answering when uncertain through enhanced capability in recognizing their knowledge boundaries, thereby improving the truthfulness.
Abstract: While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy---models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that \model significantly reduces hallucinations (e.g., 43.5\% $\rightarrow$ 19.4\%) and improves truthfulness (e.g., 5.3\% $\rightarrow$ 37.2\%), with consistent gains across various backbone models (e.g., Qwen, Llama). In-depth ablation study demonstrates that vanilla accuracy-driven methods such as supervised fine-tuning or RL with a binary reward struggle to balance factual correctness and uncertainty, whereas the truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs. Moreover, we find the improvement of TruthRL arises from enhancing the capability of LLMs to recognize their knowledge boundary, hence avoiding being overly conservative as the baselines are. Further analysis validates our method across multiple evaluation judges, and confirms that TruthRL is robust to hallucination-baiting questions.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22031
Loading