GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking

GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking

ICLR 2026 Conference Submission17894 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: General Multimodal Reasoning, Visual Reasoning, Multimodal Large Language Models

Abstract: Despite recent advances in multimodal reasoning, Multimodal Large Language Models (MLLMs) still underperform on complex vision-centric reasoning tasks compared to their strong capabilities in language-based reasoning. This performance gap stems from a critical asymmetry in their reasoning processes: while MLLMs excel at iterative reflection and correction in textual contexts, they tend to uncritically accept their initial visual interpretations and rarely revise them, even when these cues lead to logical inconsistencies. To overcome this shortcoming, we introduce \textbf{GThinker}, a general-purpose reasoning MLLM that unifies robust textual reasoning with a novel, adaptive visual rethinking capability. GThinker first introduces Cue-Rethinking, a flexible reasoning pattern that not only grounds reasoning in visual cues but also strategically triggers rethinking of these cues to resolve visual inconsistency for solid reasoning. To cultivate this adaptive capability across domains, we further design a two-stage training pipeline, including the pattern-guided cold start with the judge-guided selective training and incentive reinforcement learning. Furthermore, we construct GThinker-11k to support the training, a dataset containing 7K cue-annotated chain-of-thought data and 4K diverse reinforcement samples, using the designed iterative multimodal annotation pipeline. Extensive experiments demonstrate that GThinker achieves 81.5\% on the challenging comprehensive multimodal reasoning benchmark M$^3$CoT, surpassing the latest O4-mini model. It also shows an average improvement of 2.1\% on general scenario multimodal reasoning benchmarks, while maintaining on-par performance in mathematical reasoning compared to counterpart advanced reasoning models.

Supplementary Material: pdf

Primary Area: foundation or frontier models, including LLMs

Submission Number: 17894

Loading