Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation

Yuting Li; Lai Wei; Kaipeng Zheng; Jingyuan Huang; Guilin Li; Bo Wang; Linghe Kong; Lichao Sun; Weiran Huang

Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation

Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Guilin Li, Bo Wang, Linghe Kong, Lichao Sun, Weiran Huang

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal LLM, Multimodal reasoning, Image perturbation

TL;DR: We show that MLLMs often fail to leverage visual inputs effectively. We propose a lightweight VP framework that boosts reasoning robustness across training methods and benchmarks, demonstrating that better reasoning begins with better seeing.

Abstract: Despite the rapid progress of multimodal large language models (MLLMs), the role of visual processing in multimodal reasoning remains underexplored. In a simple yet revealing experiment, we find that language-only models, when augmented with image captions, can sometimes outperform multimodal counterparts consuming raw visual inputs. This indicates that current MLLMs may perceive visual content but fail to effectively integrate it during reasoning. Moreover, even minimal visual perturbations such as small rotations lead to severe performance drops, exposing a fragility in their visual understanding. To address this overlooked bottleneck, we propose a lightweight visual perturbation (VP) framework that strengthens perceptual robustness without architectural changes or additional data. VP introduces three targeted dominance-preserving mixup, random rotation, and distractor concatenation which can be seamlessly integrated into post-training pipelines including SFT, DPO, and GRPO. Extensive experiments across four multimodal reasoning benchmarks show consistent absolute gains of 1–2 points, with improvements holding across datasets, training pipelines, and even advanced RL-tuned models. Ablation and task-level analyses further reveal how different perturbations uniquely benefit geometry, algebra, OCR, and chart reasoning. These findings underscore a central insight: better reasoning begins with better seeing.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7565

Loading