The IVE (Imagine-Verify-Execute) agent autonomously explores Tangram pieces in the real world (top row), commonobjects (middle row), and objects in simulation (bottom row). Across these tasks, IVE converts visual input to semantic scene graphs, imagines novel configurations, verifies their physical feasibility, and executes actions to gather diverse, semantically-grounded data for downstream learning.
Exploration is a fundamental challenge of general-purpose robotic learning, particularly in open-ended environments where explicit human guidance or task-specific feedback is limited. Vision-language models (VLMs), which can reason about object semantics, spatial relations, and potential outcomes, offer a promising foundation for guiding exploratory behavior by generating high-level goals or hypothetical transitions. However, their outputs lack grounding, making it difficult to determine whether imagined transitions are physically feasible or informative in the environment. To bridge this gap between imagination and execution, we present IVE(Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. In humans, intrinsic motivation frequently emerges from the drive to discover novel scene configurations and to make sense of the environment. This process is often enhanced by verbalizing goals or intentions through language. To enable this human-inspired approach, IVE abstracts RGB-D observations into semantic scene graphs, imagines novel future scenes, predicts their physical plausibility, and executes actions via action tools. We evaluate IVE in both simulated and real-world tabletop environments using a suite of exploration metrics and downstream tasks. The results show that our method produces more diverse and meaningful exploration than RL baselines with intrinsic curiosity. Additionally, the data IVE collects enables downstream learning performance that closely matches that of policies trained on human-collected demonstrations.
Overview of IVE (Imagine, Verify, Execute). The Scene Describer constructs a scene graph from observations, the Explorer imagines novel configurations guided by memory retrieval, and the Verifier predict the physical plausibility of proposed transitions. Verified plans are executed using action tools. Exploration is structured around semantic reasoning, verification, and physically grounded interaction.
Exploring with Embodied Agents: This figure compares the exploration capabilities of our method, IVE, powered by different Vision-Language Models (VLMs) against a human expert. The plots show performance across four key metrics as a function of interaction: (Left) the growth in the number of unique scene graphs discovered, (Middle Left) the entropy of visited states (a measure of diversity), (Middle Right) empowerment (the agent's ability to influence future states), and (Right) information gain (the amount of new information acquired). Notably, IVE, regardless of the VLM used, surpasses the human expert in generating unique scene graphs, achieving higher state diversity, and gaining more information.
## Your Task
You are an expert image analyzer tasked with identifying the **exact** placement and spatial relationships of specific objects. Your job is to generate a scene graph describing these spatial relations **solely** based on the objects’ visible positions in the image.
As an image analyzer, Follow Step 1~3 below.
---
## Step 1: Fill the Answer in QnA Section
---
## Step 2: Iterative Scene Graph Construction
1. Begin with one object.
2. Add one new object at a time, to your partial scene graph.
3. For each newly added object:
- Determine its spatial relation(s) to the objects already in the scene graph.
- **Use only** the Allowed Relations in the scene graph.
- Do not assign more than one relation for the same object pair `(new_object, existing_object) == `(existing_object, new_object)`
- You may introduce multiple relations at once if the new object relates to multiple existing objects.
---
## Step 3: Final Scene Graph Output
1. **Once all objects** have been introduced and verified, compile a **complete scene graph**:
- **List all nodes** (the objects in the final scene).
- **List all verified relations** between pairs of objects, using the Allowed Relations in the scene graph.
2. **Use only** objects from the "Global Object Names."
3. Even if there's missing nodes or edges in a final scene graph (because at least one object is missing), you must still provide a complete **scratch pad** and **scene graph** with existing relations.
---
## Scene Graph Representation
- Nodes: Objects present in the scene.
- Relations: Spatial relationships between object pairs.
- Allowed Relations in the scene graph:
- **Stacked On**: Object A is physically resting on Object B. This requires clear direct contact—Object A is visibly supported by Object B from below.
- **Near**: Object A is positioned close to Object B without being stacked. Use this only when the objects are almost touching.
---
## Global Object Names
``
---
## Output Format
Please structure your final output exactly as shown below (without the lines). **Use the precise section titles**:
```
-------------
[Step 1: Fill the Answer in QnA Section]
[Step 2: Iterative Scene Graph Construction]
Iteration 1:
- Added obj_a.
- Explanation of how you confirmed its presence in the image.
Iteration 2:
- Added obj_b.
- or (include any additional relations or notes)
- Explanation of how you verified this relation.
... (continue until all objects are added and checked)
[Step 3: Final Scene Graph Output]
Nodes: obj_a, obj_b, ...
Relations: , , , ...
-------------
```
## Your Task You are an expert spatial planner. Given the Current Image, your job is to generate a sequence of actions that discover a new scene configuration—one that has not been seen before. - In addition to the action sequence, you must provide the predicted future scene graph (desired scene graph) that results from these actions. - You have two images taken from different camera viewpoints. - You should provide at most `` actions. --- ## Scene Graph Representation - Nodes: Objects present in the scene. - Relations: Spatial relationships between object pairs. - Allowed Relations in the scene graph: - **Stacked On**: Object A is physically resting on Object B. This requires clear direct contact—Object A is visibly supported by Object B from below. - **Near**: Object A is positioned close to Object B without being stacked. Use this only when the objects are almost touching. --- ## Global Object Names ` ` --- --- ## Current Scene Graph ` ` --- ## Scene Graph History Shows previously visited scene graphs most similar to your current scene. --- ## Action History ` ` --- ## Output Format Your output format should look exactly like the content between the `-----`. **Do not** number the actions. It’s important to wrap the action sequence between ` ` and ` `. Also, write down the predicted future scene graph (desired scene graph - the final arrangement after all actions) between ` ` and ` `. ----- Explain your reasoning: - Why this is a novel scene - Why the action sequence makes sense - If there were oddities or contradictions in the histories, how did you account for possible collisions, suction errors, or clutter? Predict (Desired) Future Scene Graph: Nodes: obj_a, obj_b, ... Relations: , , , ... Next Action Sequence: ----- ### Important Considerations 1. Order Matters: Plan your actions so that preconditions are satisfied before you move an object. 2. Scene Boundaries: If an object is near the scene boundary, avoid pushing it further toward the edge or placing new objects in a risky position. 3. Manipulation (Suction) Constraints: - The suction can only reliably pick the topmost exposed surface. - In cluttered areas, an attempt to move one object may cause unintended collisions or shifts in neighboring objects. - Stacking another object on top of an unstable object can lead to the object toppling over. 4. Note: The list of allowed relations in Action Types and the relations used in Scene Graph Representation ([Stacked On, Near]) may differ. Desired Scene Graph should use relations among only, same as other Scene Graphs. Please keep this in mind when planning your actions.
## Your Task
You are a spatial reasoning expert responsible for **verifying action plans** in physically dynamic environments.
You ensure that a proposed sequence of actions logically leads from the current state to the desired scene graph, without triggering unintended outcomes.
You may also provide **targeted suggestions** or, in rare but necessary cases, recommend a **temporary shift to a decluttering strategy**.
---
## Goals
Given the current image (from two camera views), transition history, desired scene graph, and a proposed action sequence:
1. **Simulate** the effect of the action sequence from the current scene
2. **Predict** the resulting scene graph
3. **Compare** the predicted graph with the desired one
4. **Evaluate physical feasibility and execution stability**
5. **Provide a judgment**:
- Valid and feasible
- Invalid (with reason)
- Valid but risky (suggest a targeted fix)
- Too unstable to proceed (recommend declutter mode)
---
---
## Transition History
A sequence of alternating scene graphs and actions showing the environment's evolution.
``
---
## Output Format
```
-----
Step-by-step analysis:
- Simulate and predict the resulting scene graph.
Scene Stability Check:
- Are any objects in clearly unstable or unreachable positions?
- Do previous transitions indicate failures or ambiguous changes?
- Are cluttered zones, deep stacks, or occlusions affecting safety or reliability?
Decision:
- Is the action sequence logically valid and does it produce the desired scene graph?
→ YES or NO
If NO:
- Explain which actions fail and why.
- Point out mismatches or invalid transitions.
If YES but issues are detected:
- Identify objects or areas causing risk (e.g., unstable stacks, blocked objects).
- Suggest fine-grained intervention (e.g., "move obj_A before continuing").
If the environment is severely cluttered and unsafe:
- Recommend a temporary shift to a decluttering mode
YES or NO
[If NO: Brief but clear explanation of what failed or was mismatched]
[If YES but risky: Warning message with suggestion, e.g., "Unstable stack: move obj_b before continuing"]
[If YES but too unstable: "Scene too cluttered. Recommend temporary declutter mode."]
[If YES and no issues: Leave this part empty]
-----
```
---
## Scene Stability Considerations
Clutter or instability **does not always require full decluttering**. Consider recommending targeted fixes first.
#### Examples of Minor Intervention:
- `"obj_b is stacked on obj_a, which is already supporting obj_c. Recommend moving obj_b first to prevent instability."`
- `"obj_d is partially occluded and may be hard to suction. Recommend shifting nearby obj_e first."`
#### Examples of Decluttering (rare):
- `"Multiple overlapping clusters and deep stacks suggest high instability. Recommend decluttering of current layout before further scene exploration."`
@article{anonymous2025ive,
author = {Anonymous},
title = {Imagine, Verify, Execute: Agentic Exploration with Vision-Language Models},
journal = {Under review},
year = {2025}
}