ICoM: Interleaved CoT with Adaptive Visual Focusing and Layer-specific Merging for Advanced Mathematical Reasoning
Keywords: layer-specific, merging, CoT, visual focusing, reasoning
Abstract: Vision Large Language Models (VLLMs) have achieved remarkable progress in multimodal reasoning. However, they often generate text-only reasoning steps based on internal priors, making it difficult to dynamically focus on critical visual regions. Multimodal interleaved Chain-of-Thought (CoT) paradigm, built on visual modules can incorporate visual inputs, but they typically require additional tools and multi-step interactions. To address these issues, we propose a coupled framework, ICoM, that integrates interleaved CoT driven adaptive visual focusing with layer-specific merging. ICoM employs a Q-Former to adaptively retrieve the most relevant regions from the original image via interleaved tokens, and are inserted before each textual reasoning step to enable visual focus. We train Qwen2-VL-2B with a three-stage SFT+RL pipeline on the open-source MINT-CoT dataset. To enhance reasoning in a cost effective way, we linearly merge only layers 19–27 of the post trained Qwen2-VL-2B language component with the corresponding parameters of Qwen2-Math-1.5B-Instruct. Experiments show that ICoM-2B is competitive with state-of-the-art VLLMs (e.g., LLaVA-Reasoner-8B and Mulberry-7B) across six benchmarks. Notably, ICoM-2B outperforms GPT-4o-0513 by 2.13\% on MathVista and 0.22\% on MMStar. Code will be released once the paper is accepted.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Generation, Multimodality and Language Grounding to Vision, Robotics and Beyond, Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning, Machine Learning for NLP
Languages Studied: Chinese, English
Submission Number: 8287
Loading