Generative Adversarial Optimization: Dual-Reward Reinforcement Learning for Mathematics Reasoning

Generative Adversarial Optimization: Dual-Reward Reinforcement Learning for Mathematics Reasoning

ICLR 2026 Conference Submission18892 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Reasoning Model, Math

TL;DR: The game-theoretic approach empowers the model to achieve state-of-the-art performance in mathematical problem-solving.

Abstract: Despite recent progress achieved by large language models (LLMs), their remarkable mathematics reasoning abilities are largely dependent on fine-tuning on the annotated data, lacking generalization on out-of-distribution tasks. To address this, current methods adopt reinforcement learning (RL) to incentivize the latent capabilities of LLMs, mitigating the need for annotations. However, they often suffer from uncontrollable data difficulty and limited initial capabilities. In this paper, we propose Generative Adversarial Optimization (GAO), a novel reinforcement learning framework consists of a problem poser and a problem solver which are optimized by dual-reward iteratively. Specifically, the poser attempts to propose challenging problems to stump the solver, while the solver strives to solve them. The complete adversarial process is recorded to generate bidirectional rewards, enabling both the poser and solver to co-evolve through this competitive interaction. Experimental results show that GAO achieves state-of-the-art performance compared to previous models of the same size, even without relying on proprietary LLMs.

Primary Area: reinforcement learning

Submission Number: 18892

Loading