v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Multimodal Reasoning, Visual Grounding
TL;DR: We present v1, a novel multimodal reasoning framework that dynamically revisits and grounds visual information during inference, outperforming static image processing approaches on mathematical reasoning benchmarks.
Abstract: When thinking with images, humans rarely rely on a single glance: they revisit visual information repeatedly during reasoning. However, existing models typically process images only once and thereafter generate reasoning entirely in text, lacking mechanisms to re-access or ground inference in visual representations. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. In response, we introduce v1, a lightweight extension that enables active visual referencing through a simple point-and-copy approach. This allows the model to identify relevant image patches and copy their embeddings back into the reasoning stream, ensuring that evolving hypotheses remain grounded in perceptual evidence. Crucially, our pointing strategy lets the MLLM directly select image patches using their semantic representations as keys, keeping perceptual evidence embedded in the same space as the model’s reasoning. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Across various multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines, establishing dynamic visual access based on point-and-copy as a practical mechanism for grounded reasoning. We will release the model checkpoint and data.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5351
Loading