Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models

Seungjae Lee; Daniel Ekpo; Haowen Liu; Furong Huang; Abhinav Shrivastava; Jia-Bin Huang

Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models

Seungjae Lee, Daniel Ekpo, Haowen Liu, Furong Huang, Abhinav Shrivastava, Jia-Bin Huang

Published: 08 Aug 2025, Last Modified: 16 Sept 2025CoRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Exploration, Agentic System, Vision-Language Model

Abstract: Exploration is key for general-purpose robotic learning, particularly in open-ended environments where explicit guidance or task-specific feedback is limited. Vision-language models (VLMs), which can reason about object semantics, spatial relations, and potential outcomes, offer a promising foundation for guiding exploratory behavior by generating high-level goals or transitions. However, their outputs lack grounding, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration often emerges from the drive to discover novel scene configurations and to understand the environment. Inspired by this, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE produces more diverse and meaningful exploration than RL baselines. The collected data facilitates learning downstream tasks that closely match those of policies trained on human-collected demonstrations.

Supplementary Material: zip

Spotlight: mp4

Submission Number: 907

Loading