TIGER: Bridging the Multimodal Reasoning-Access Gap via Modality Counterfactuals

Published: 26 May 2026, Last Modified: 26 May 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: MLLM reasoning, generalisation
Abstract: While Multimodal Large Language Models (MLLMs) exhibit strong reasoning on text inputs, they often fail on semantically equivalent visual inputs. By rendering text problems as images, we isolate this failure and identify a reasoning-access gap: models correctly perceive visual content but fail to route that content into the latent reasoning mechanisms used for text-based tasks. To address this, we propose TIGER (Text-to-Image Gap-targeted Training for Enhanced Reasoning). TIGER automatically transforms text-only corpora into multimodal training data by mining modality counterfactuals, instances where a model succeeds on text but fails on the equivalent image, providing targeted supervision without manually curated datasets. Implemented via image-conditioned Group Relative Policy Optimization (GRPO), TIGER consistently narrows the modality gap and improves visual reasoning on benchmarks like MathVerse and EMMA. We further show that even RLVR-based models exhibit modality-dependent reasoning gaps, and that TIGER effectively reduces them. Furthermore, activation analyses reveal that TIGER helps visual representations better engage reasoning-relevant subspaces within the language backbone. Our results emphasize that robust multimodal reasoning requires reliable visual access to existing reasoning machinery, moving beyond better perception.
Submission Number: 216
Loading