VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning

VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning

ICLR 2026 Conference Submission14431 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Reasoning, RLVR, Visual Uncertainty-Guided Exploration

TL;DR: We introduce VOGUE (visual uncertainty–guided exploration), a novel method that enhances exploration by shifting it from the output (text) space to the input (visual) space, treating the image as a stochastic context.

Abstract: Reinforcement learning with verifiable rewards (RLVR) improves reasoning in LLMs but struggles with exploration, an issue that still persists for Multimodal LLMs (MLLMs). Current methods treat the visual input as a fixed, deterministic condition, overlooking a critical source of ambiguity and struggling to build policies robust to plausible visual variations. We introduce $\textbf{VOGUE (Visual-Uncertainty–Guided Exploration)}$, a novel method that shifts exploration from the output (text) to the input (visual) space. By treating the image as a stochastic context, VOGUE quantifies the policy's sensitivity to visual perturbations using the symmetric KL divergence between a "raw" and "noisy" branch, creating a direct signal for uncertainty-aware exploration. This signal shapes the learning objective via an uncertainty-proportional bonus, which, combined with a token-entropy bonus and an annealed sampling schedule, effectively balances exploration and exploitation. Implemented within GRPO on two model scales (Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6\% on three visual math benchmarks and 3.7\% on three general-domain reasoning benchmarks, while simultaneously increasing pass@4 performance and mitigating the exploration decay commonly observed in RL fine-tuning. Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14431

Loading