Don't Lose Sight: Visually-Grounded Credit Assignment for Multimodal Reasoning

Jiale Hong; Dongjie Yang; Xuecai Hu; Yong Wang; Xiangxiang Chu; hai zhao

Don't Lose Sight: Visually-Grounded Credit Assignment for Multimodal Reasoning

Jiale Hong, Dongjie Yang, Xuecai Hu, Yong Wang, Xiangxiang Chu, hai zhao

16 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM

Abstract: Reinforcement Learning (RL) has shown promise for large language models, but its direct application to multimodal LLMs (MLLMs) faces unique challenges. Unlike text-only LLMs, MLLMs must jointly optimize for visual grounding and language reasoning. Our analysis reveals that RL primarily enhances textual reasoning, while the crucial visual grounding aspect stalls, creating a bottleneck for overall model performance. This observation highlights a critical mismatch: the learning challenge in MLLMs is concentrated in visually-grounded tokens, yet existing RL algorithms apply uniform optimization pressure across all tokens, thereby diluting the learning effort. Motivated by this limitation, we propose Visually-grounded Credit Assignment (VICRA), a simple yet effective approach that reallocates optimization pressure toward visually-grounded tokens, explicitly correcting the token-level imbalance overlooked by prior methods. Extensive experiments across benchmarks, base models, and training data show that VICRA consistently enhances multimodal reasoning, achieving significant gains over strong RL baselines. Our work establishes a general framework for more balanced and effective reinforcement learning in MLLMs.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6882

Loading