Imagine, Verify, Execute: Agentic Exploration with Vision-Language Models

Anonymous
1Anonymous Institution,

The IVE (Imagine-Verify-Execute) agent autonomously explores Tangram pieces in the real world (top row), commonobjects (middle row), and objects in simulation (bottom row). Across these tasks, IVE converts visual input to semantic scene graphs, imagines novel configurations, verifies their physical feasibility, and executes actions to gather diverse, semantically-grounded data for downstream learning.

Abstract

Exploration is a fundamental challenge of general-purpose robotic learning, particularly in open-ended environments where explicit human guidance or task-specific feedback is limited. Vision-language models (VLMs), which can reason about object semantics, spatial relations, and potential outcomes, offer a promising foundation for guiding exploratory behavior by generating high-level goals or hypothetical transitions. However, their outputs lack grounding, making it difficult to determine whether imagined transitions are physically feasible or informative in the environment. To bridge this gap between imagination and execution, we present IVE(Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. In humans, intrinsic motivation frequently emerges from the drive to discover novel scene configurations and to make sense of the environment. This process is often enhanced by verbalizing goals or intentions through language. To enable this human-inspired approach, IVE abstracts RGB-D observations into semantic scene graphs, imagines novel future scenes, predicts their physical plausibility, and executes actions via action tools. We evaluate IVE in both simulated and real-world tabletop environments using a suite of exploration metrics and downstream tasks. The results show that our method produces more diverse and meaningful exploration than RL baselines with intrinsic curiosity. Additionally, the data IVE collects enables downstream learning performance that closely matches that of policies trained on human-collected demonstrations.

Project video

Overview

Overview of IVE (Imagine, Verify, Execute). The Scene Describer constructs a scene graph from observations, the Explorer imagines novel configurations guided by memory retrieval, and the Verifier predict the physical plausibility of proposed transitions. Verified plans are executed using action tools. Exploration is structured around semantic reasoning, verification, and physically grounded interaction.

Ablation baselines

Exploring with Embodied Agents: This figure compares the exploration capabilities of our method, IVE, powered by different Vision-Language Models (VLMs) against a human expert. The plots show performance across four key metrics as a function of interaction: (Left) the growth in the number of unique scene graphs discovered, (Middle Left) the entropy of visited states (a measure of diversity), (Middle Right) empowerment (the agent's ability to influence future states), and (Right) information gain (the amount of new information acquired). Notably, IVE, regardless of the VLM used, surpasses the human expert in generating unique scene graphs, achieving higher state diversity, and gaining more information.

Prompts


BibTeX

@article{anonymous2025ive,
  author    = {Anonymous},
  title     = {Imagine, Verify, Execute: Agentic Exploration with Vision-Language Models},
  journal   = {Under review},
  year      = {2025}
}