Keywords: RL, Risk based optimization, LLM post-training
Abstract: Reinforcement Learning with Verifiable Reward has become a central paradigm for post-training Large Language Models (LLMs). Group Relative Policy Optimization (GRPO) with the mean-based objective suffers from limited exploration and reasoning gains. We propose Risk-based Policy Optimization (RiskPO), which leverages risk measures from Operations Research to address these issues. In particular, we introduce a Mixed Value-at-Risk objective and adopt a bundle-wise training scheme that bundles multiple questions to provide stable and informative signals. Numerical results show that RiskPO consistently outperforms GRPO and its variants across multiple mathematical reasoning benchmarks, achieving substantial improvements on both Pass@1 and Pass@k metrics. These results highlight the effectiveness of risk-based optimization in enhancing exploration and expanding the reasoning capabilities of LLMs.
Submission Number: 101
Loading