Reinforcement Learning with Missing Context to Mitigate Reward Hacking from Training Only on Golden Answers

Yongqi Tong; Zhenyu Zhang; Zimi Liu; Tangzeyun; Zhihan Peng; Kewei Fu; Mingli Song; Haofei Zhang; Junshao Zhang; Hong Zhu; JIANSHE LI

Reinforcement Learning with Missing Context to Mitigate Reward Hacking from Training Only on Golden Answers

Yongqi Tong, Zhenyu Zhang, Zimi Liu, Tangzeyun, Zhihan Peng, Kewei Fu, Mingli Song, Haofei Zhang, Junshao Zhang, Hong Zhu, JIANSHE LI

18 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, benchmark, large language models, conversational llm, chatbots, synthetic data, reward hacking

Abstract: Reinforcement learning (RL) for reasoning has achieved remarkable progress in recent years. However, much of this progress has been evaluated in overly idealized settings. In most existing benchmarks, problems are deterministic, carefully curated, and fully specified. While such settings make evaluation straightforward, real-world reasoning tasks are often underspecified, lack crucial contextual information, or even contain misleading premises. Hence, we argue that most current RL training paradigms based on verifiable rewards amount to an implicit form of reward hacking. Our experiments show that many state-of-the-art reasoning models tend to overcommit to producing a single definite answer, even when the problem is inherently underspecified. To address this gap, we propose \emph{Reinforcement Learning with Missing Context} (RLMC), a framework that explicitly trains models on problem instances with missing, underspecified, or incorrect context. We construct a large-scale RL dataset of 120K queries by intentionally synthesizing such imperfect questions, encouraging models to identify uncertainty, make reasonable assumptions, and reason effectively under incomplete information. Experimental results show that RLMC-trained models exhibit substantial gains in robustness, reduced hallucinations, and improved overall reasoning capabilities compared to baselines trained only on fully specified tasks. We further introduce \textsc{Reasoning Beyond the Given} (RBG), a benchmark designed to evaluate whether models can detect missing or inconsistent information and proactively elicit clarifying input from users. Evaluation on \textsc{RBG} fully exposes current models' limitations in handling imperfect problem statements. Code, \textsc{RBG} and train data will be fully released at \url{https://anonymous.4open.science/r/RLMC-RBG}.

Primary Area: reinforcement learning

Submission Number: 11678

Loading