Right Question is Already Half the Answer: Fully Unsupervisedd LLM Reasoning Incentization

Qingyang Zhang; Haitao Wu; Changqing Zhang; Peilin Zhao; Yatao Bian

Right Question is Already Half the Answer: Fully Unsupervisedd LLM Reasoning Incentization

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, Yatao Bian

Published: 10 Jun 2025, Last Modified: 14 Jul 2025PUT at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Reasoning; Unsupervised learning

Abstract: While large language models (LLMs) have demonstrated exceptional capabilities in challenging tasks such as mathematical reasoning, existing methods to enhance reasoning ability predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data. However, these approaches critically depend on external supervisions--such as labeled reasoning traces, verified golden answers, or pre-trained reward models--which limits scalability and practical applicability. In this work, we propose Entropy Minimized Policy Optimization, which makes an early attempt at fully unsupervised LLM reasoning incentivization. EMPO does not require any supervised information for incentivizing reasoning capabilities. By continuously minimizing the predictive entropy of LLMs on unlabeled user queries in a latent semantic space, EMPO enables purely self-supervised evolution of reasoning capabilities with strong flexibility and practicality. Our experiments demonstrate competitive performance of EMPO on both mathematical reasoning and free-form natural reasoning tasks. Specifically, without any supervised signals, EMPO boosts the accuracy of Qwen2.5-Math-7B Base from 30.7\% to 48.1\% on mathematical benchmarks and improves the accuracy of Qwen2.5-7B Base from 32.1\% to 50.1\% on MMLU-Pro.

Submission Number: 4

Loading