Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

Qingyang Zhang; Haitao Wu; Changqing Zhang; Peilin Zhao; Yatao Bian

Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, Yatao Bian

Published: 18 Sept 2025, Last Modified: 24 Dec 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Reasoning, Reinforcement Learning, Unsupervised Learning

TL;DR: We makes an early attempt at fully unsupervised LLM reasoning incentivization.

Abstract: Existing methods to enhance the reasoning capability of large language models predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data. These approaches critically depend on external supervisions--such as labeled reasoning traces, verified golden answers, or pre-trained reward models. In this work, we propose Entropy Minimized Policy Optimization (EMPO), which makes an early attempt at fully unsupervised LLM reasoning incentivization. By minimizing the semantic entropy of LLMs on unlabeled questions, EMPO achieves competitive performance compared to supervised counterparts. Specifically, without any supervised signals, EMPO boosts the accuracy of Qwen2.5-Math-7B Base from 33.7\% to 51.6\% on math benchmarks and improves the accuracy of Qwen2.5-7B Base from 32.1\% to 50.1\% on MMLU-Pro. Primary analysis are also provided to interpret the effectiveness of EMPO. Code is available at https://github.com/QingyangZhang/EMPO.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 2554

Loading