VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training

VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training

ACL ARR 2026 January Submission3813 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Value Modeling, Semantic Awarenss, Correcting reward bias, Robustness

Abstract: Reinforcement Learning (RL) in real-world environments often suffers from ambiguous or incomplete reward supervision, which undermines policy stability and generalization. Such noise may cause models to ignore key information or even collapse in advantage estimation. We find that a strong value model is essential for absorbing unstable signals and producing reliable advantages, offering denser and more robust supervision than the reward model. To better optimize noisy supervision, we propose VRPO, a framework that enhances value modeling for robust RL in LLM post-training. VRPO integrates (1) auxiliary losses guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck, enabling the value model to filter noise and capture key words. This design allows the value model to correct noise rewards and generate more reliable advantage estimates, transforming it from a passive predictor into an active noise regulator. Experiments on multi-turn dialogue, math reasoning, and science QA with both rule-based and model-based rewards show that VRPO consistently outperforms baselines such as PPO and GRPO. Our work highlight the central role of the value model in Robust RL and provide a principled and practical approach to policy optimization under noisy supervision.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Mathematical NLP,Scientific NLP,Conversational modeling,Robustness

Contribution Types: NLP engineering experiment, Data analysis

Languages Studied: English, Chinese

Submission Number: 3813

Loading