VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Multimodal Models, Multimodal Reasoning, Benchmark
Abstract: Recent works demonstrated that long-chain reasoning paradigms can enhance capabilities of multimodal large language models (MLLMs) to solve complex problems. However, the precise reasons for the effectiveness of such paradigms remain unclear and difficult to probe. Specifically, it is challenging to analyze with quantitative results how much the model's extraction of visual cues and reasoning during the long-chain inference process contribute to its performance improvements. Therefore, evaluating the faithfulness of MLLMs' reasoning to visual information is crucial. To address this issue, we first present a cue-driven automatic and instruction-following image editing pipeline with GPT-Image-1. Furthermore, we introduce VFaith-Bench, the first benchmark to our knowledge to evaluate MLLMs' visual faithfulness when generating long reasoning process. Using the designed pipeline, we constructed comparative question-answer pairs by editing the visual cues in images that are crucial for solving the original reasoning problem, thereby changing the question's answer to another option. By testing similar questions with images that have different details, the average accuracy reflects the model's visual reasoning ability, while the difference in accuracy before and after editing the test set images effectively reveals the model's faithfulness of reasoning to visual cues. We developed a filtering mechanism based on multi-model detection to identify error reason and self-contradictory within images. This approach, combined with manual verification, effectively eliminates image quality degradation. We conducted in-depth testing and analysis of existing mainstream flagship models and prominent open-source model series/reasoning models on VFaith-Bench, further investigating the underlying factors of their reasoning capabilities. Our code and data will be open-sourced after review period.
Primary Area: datasets and benchmarks
Submission Number: 11356
Loading