Keywords: Multimodal Large Language Models (MLLMs); Visual Reasoning; Chain-of-Thought (CoT); Causal Analysis; Visual Grounding; Interpretability;
TL;DR: We introduce Contrastive Region Masking (CRM), a training-free diagnostic that causally tests how multimodal LLMs depend on visual regions.
Abstract: We present Contrastive Region Masking (CRM), a training-free diagnostic that reveals how multimodal large language models (MLLMs) depend on specific visual regions at each step of chain-of-thought (CoT) reasoning. Unlike prior approaches limited to final answers or attention maps, CRM delivers causal, step-level attribution by systematically masking annotated regions and contrasting the resulting reasoning traces with unmasked baselines. Applied to datasets such as VisArgs, CRM exposes distinct failure modes: some models preserve reasoning structure, but hallucinate when evidence is missing, while others ground tightly to visual cues yet collapse under perturbations. By shifting the evaluation from correctness of answers to faithfulness of reasoning, CRM reframes visual benchmarks as diagnostic tools—highlighting the need for multimodal evaluation frameworks that measure not just performance, but also robustness and fidelity of reasoning.
Submission Number: 201
Loading