Semantic-Enriched Latent Visual Reasoning

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose the SLVR algorithm, introduce the SLV-Set training dataset, and present the SV-QA benchmark for evaluating semantic-enriched latent visual reasoning.
Abstract: Multimodal latent-space reasoning aims to replace explicit “thinking with images” by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce \textbf{Semantic-Enriched Latent Visual Reasoning (SLVR)}, a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct \textbf{SLV-Set}, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce \textbf{SV-QA}, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines. Our code and datasets are available at \url{https://github.com/tinnel123666888/slvr}.
Lay Summary: Modern AI systems that combine vision and language can answer questions about images, but they often struggle when the same part of a picture is asked about in different ways. For example, looking at a person in a photo, the AI might correctly answer "what color is their shirt?" but then fail at "what are they doing?"—even though both questions are about the same person. This is because the AI's internal understanding of the image captures only surface-level visual features, not the rich semantic details that humans naturally notice, such as objects, colors, actions, and spatial relationships. In this work, we propose SLVR, a training approach that teaches AI to build a richer internal representation of each visual region by aligning it with detailed semantic descriptions, and then encourages the AI to use this representation consistently when answering different questions about the same region. To support this, we automatically build a large dataset of about 400,000 region descriptions and 800,000 paired questions, and release a benchmark that tests whether the AI can consistently handle multiple questions about the same visual region. Across diverse evaluations, SLVR produces more accurate and more consistent answers, moving the field toward AI systems that truly understand what they see rather than just describing surface appearances.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/tinnel123666888/slvr
Primary Area: Applications->Computer Vision
Keywords: Multimodal Latent Reasoning, Vision–Language Models, Region-Level Semantics
Originally Submitted PDF: pdf
Submission Number: 1556
Loading