Keywords: Latent Visual Reasoning, Multimodal LLM, VQA
Abstract: Vision-language models (VLMs) demonstrate strong capabilities in visual question answering (VQA), yet their reasoning remains confined to the text modality, limiting their alignment with inherently visual tasks. Latent visual reasoning emerges as a promising alternative, offering more efficient and flexible inference. However, existing approaches focus primarily on modeling latent visual cues, overlooking the importance of aligning latent reasoning with standard text decoding. In this work, we propose a supervised multi-step latent reasoning framework that scales implicit reasoning to VLMs. Our method introduces step-level supervision on latent hidden states via a freezing LLM decoder during training, bridging the modality gap and enriching semantic diversity. The LLM decoder is removed at inference to maintain native decoding efficiency. Experiments show that our latent visual reasoning approach matches the performance of explicit CoT finetuning on Qwen3-VL, while significantly reducing token usage.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision Robotics and Beyond, Language Modeling, Question Answering
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 1188
Loading