Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models

Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models

ICLR 2026 Conference Submission4885 Authors

13 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Reasoning, Multimodal Chain of Continuous Thought, Latent Reasoning

TL;DR: Exploring the Continuous Thought in Latent-Space Reasoning for VLMs

Abstract: Many reasoning techniques for large multimodal models adapt language model approaches, such as Chain-of-Thought (CoT) prompting, which express reasoning as word sequences. While effective for text, these methods are suboptimal for multimodal contexts, struggling to align audio, visual, and textual information dynamically. To explore an alternative paradigm, we propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language. In MCOUT, the reasoning state is represented as a continuous hidden vector, iteratively refined and aligned with visual and textual embeddings, inspired by human reflective cognition. We develop two variants: MCOUT-Base, which reuses the language model’s last hidden state as the continuous thought for iterative reasoning, and MCOUT-Multi, which integrates multimodal latent attention to strengthen cross-modal alignment between visual and textual features. Experiments on benchmarks including MMMU, ScienceQA, and MMStar show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23\% accuracy gains over strong baselines and improving BLEU scores up to 8.27\% across multiple-choice and open-ended tasks. These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT, offering a scalable framework for human-like reflective multimodal inference.

Primary Area: causal reasoning

Submission Number: 4885

Loading