Multi-Grained Policy Optimization for Multimodal Reasoning: From An Uncertainty Perspective

Zining Chen; Zhicheng Zhao; Fei Su; Ling Shao; Shijian Lu

Multi-Grained Policy Optimization for Multimodal Reasoning: From An Uncertainty Perspective

Zining Chen, Zhicheng Zhao, Fei Su, Ling Shao, Shijian Lu

16 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Importance Sampling Weights, Uncertainty Estimation, Reinforcement Learning, Multimodal Reasoning

Abstract: Reinforcement learning (RL) techniques, such as Group Relative Policy Optimization (GRPO), have substantially advanced the reasoning capabilities of Large Language Models (LLMs) and Multimodal LLMs (MLLMs). However, subsequent studies have revealed two key limitations of GRPO: training instability and insufficient token exploration in the optimization objective. To address these issues, methods like Group Sequence Policy Optimization (GSPO) introduce sequence-level importance sampling weights to mitigate training instability, and uncertainty-driven approaches emphasize low-probability tokens to encourage exploration. However, these approaches pay little attention to the balance of training stability and token exploration. In this paper, we propose Multi-Grained Policy Optimization (MGPO), a simple yet effective algorithm that introduces multi-grained importance sampling weights for enhanced reasoning. We first examine the effect of diverse importance sampling weights and identify their influence on training stability and token exploration during RL training. Leveraging the examination, we dynamically adjust the ratio between token-level and sequence-level importance sampling weights via uncertainty estimation on log probabilities, thereby balancing the training stability and token exploration effectively. Extensive experiments on various multimodal reasoning benchmarks demonstrate that MGPO outperforms GRPO, GSPO, as well as multiple open-source and R1-style 3B/7B models consistently across multiple widely adopted multimodal reasoning benchmarks with few lines of code modification, highlighting its superior effectiveness and generalizability.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 6914

Loading