Can I Trust Your Visual Thinking? Measuring and Improving Visual Thinking Faithfulness

Zujing Liu; Junwen Pan; Qi She; Yuan Gao; Gui-Song Xia

Can I Trust Your Visual Thinking? Measuring and Improving Visual Thinking Faithfulness

Zujing Liu, Junwen Pan, Qi She, Yuan Gao, Gui-Song Xia

07 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodality, reasoning, faithfulness

Abstract: Recent large vision–language models (LVLMs) can generate vision–text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute it to *the RL reward that only incentivizes the format of interleaved vision-text cues*, i.e., incorporating visual information into text reasoning steps. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened or corrupted. Surprisingly, the model's predictions remain nearly unchanged under visual intervention but change significantly under textual intervention, indicating that **visual evidence is largely ignored**. To further diagnose visual information, we introduce an automated LVLM-based evaluation pipeline that quantifies the faithfulness of visual cues from two perspectives: reliability and sufficiency. Our evaluation reveals that the visual information in current MCoT traces are simultaneously unreliable and insufficient. We then propose a novel MCoT learning strategy to address this issue, termed Sufficient-Component Cause Model (SCCM) learning, which encourage the MCoT to generate sufficient yet minimal visual components that are able to independently lead to correct answers. We note that the proposed SCCM is annotation-free and compatible with various RFT training for MCoT in a plug-and-play manner. Empirical results demonstrate that SCCM consistently improves faithfulness metrics across a suite of fine-grained perception and reasoning benchmarks.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2778

Loading