Visual Reasoning via Perceptual Extension and In-Context Learning

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: visual reasoning, vision-language model
Abstract: The reasoning ability with visual information has recently gained significant attention in the field of large vision-language models (LVLMs). Existing R1-like reasoning LVLMs are usually finetuned from a base LVLM on a large-scale vision-language dataset, incorporating reinforcement learning (RL) with rewards from verifiable answers. However, such reasoning LVLMs usually requires high-quality multimodal long-chain datasets for supervised finetuning in the cold start stage, and time-consuming multiple response sampling in the RL stage. Therefore, we seek to explore an efficient approach to achieve visual reasoning. To do so, we first investigate the interaction between visual and textual tokens in LVLMs, and find that although the post-trained reasoning LVLM improves the cross-modal interaction, but only at deep layers and for long responses, this improvement is negligible for short responses. Based on these observations and insights, we propose to separate the perception and reasoning process, to avoid the LVLM from generating long responses, so that the LVLM maintains cross-modal interaction ability, and do the reasoning by the LLM, which is not required to integrate cross-modal information. To this end, we leverage the existing reasoning large language models (LLMs) with a VLM extension, to synthesize visual and textual information in advance and then perform the reasoning by the LLM, without any finetuning. Furthermore, to make full use of the training samples, we use a matching mechanism to find the relevant reasoning process and incorporate them by in-context learning. We evaluate our method on the common visual reasoning benchmarks. The results show that, without extra training samples, our method achieves performance comparable to the existing post-trained reasoning LVLMs, and outperforms them with in-context learning.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8159
Loading