LantErn: Latent Visual Structured Reasoning

Published: 02 Mar 2026, Last Modified: 02 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0
Track: long paper (up to 8 pages)
Keywords: multimodal reasoning; latent reasoning; interleaved; vision-lagnuage model; transformers; decoding strategies
TL;DR: LantErn enables multimodal models to reason with compact latent visual representations, interleaving language and visual “thought” embeddings to perform visual reasoning directly in latent space.
Abstract: While language reasoning models excel in many tasks, visual reasoning is significantly harder. As a result, most large multimodal models (LMMs) default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LantErn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LantErn augments a vision-language transformer with the ability to generate and attend to continuous visual "thought" embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LantErn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. In several settings, latent visual reasoning allows smaller models to approach the performance of larger baselines, suggesting that internal latent representations provide a promising direction for more efficient multimodal reasoning.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 54
Loading