MirrorCoT: Lightweight  Multimodal Interleaved Chain-of-Thought

MirrorCoT: Lightweight Multimodal Interleaved Chain-of-Thought

ACL ARR 2026 January Submission955 Authors

26 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MLLM, VQA, Chain-of-thought

Abstract: Recent advances in multimodal interleaved Chain-of-Thought(CoT) have exhibited great potential in boosting the reasoning performance of Multimodal Large Language Models (MLLMs). However, existing work rarely examines how and when visual information should be injected during multimodal reasoning. In this work, we systematically study **MirrorCoT**, a lightweight multimodal interleaved CoT with a **query-triggered visual injection** mechanism. We conduct comprehensive comparisons against state-of-the-art baselines (e.g., VoCoT, VQD) on LLaVA-1.5 and InternVL-2 across seven benchmarks, including MMStar and HallusionBench. Our key finding is that a simple structural modification, which forces the model to explicitly emit a sub-question that triggers a visual information injection, consistently outperforms dense visual token insertion. This modification not only boosts task accuracy (e.g., delivering a 5.1% improvement on LLaVA-1.5) but also cuts the number of visual tokens required by 91.1\%. Further analysis reveals that the gains stem from **Dynamic Inquiry** (deciding when to look) and **Targeted Feature Extraction** (deciding what to retrieve), which also mitigates hallucination in long-context generation.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: Language Modeling,NLP Applications,Question Answering

Languages Studied: English

Submission Number: 955

Loading