Mitigating Hallucinations in Large Language Models via Hybrid Reinforcement Learning

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Hallucination Mitigation, Reinforcement Learning from Human Feedback, Reinforcement Learning from AI Feedback, Hybrid Reinforcement Learning, Natural Language Processing, Factual Accuracy
TL;DR: We propose a Hybrid Reinforcement Learning framework that dynamically combines human and AI feedback to significantly reduce hallucinations in large language models while maintaining text quality and scalability.
Abstract: Large Language Models (LLMs) demonstrate remarkable text generation capabilities but remain prone to hallucinations---fluent outputs containing factual errors or unverifiable claims. We introduce Hybrid Reinforcement Learning (HRL), a framework that dynamically integrates Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) through a learnable context-dependent weighting mechanism. Our approach computes an adaptive mixing parameter $\alpha(c,t)$ based on 16-dimensional features capturing question complexity, model uncertainty, and training progress. Validated on TruthfulQA, HaluEval, and Anthropic HH-RLHF dataset with real human preference annotations, HRL achieves 5\% accuracy improvement and 35\% hallucination reduction over static baselines, demonstrating effective integration of human judgment with scalable AI feedback.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 14138
Loading