<!DOCTYPE html>
<html>

<head>
    <meta charset="utf-8">
    <meta name="description" content="IVE: Imagine, Verify, Execute: Agentic Exploration with Vision-Language Models">
    <meta name="keywords" content="IVE, Vision-Language Models, Agentic Exploration">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Imagine, Verify, Execute: Agentic Exploration with Vision-Language Models</title>

    <!-- Global site tag (gtag.js) - Google Analytics -->
    <!-- <script async src="https://www.googletagmanager.com/gtag/js?id=G-PYVRSFMDRL"></script> -->
    <script>
        window.dataLayer = window.dataLayer || [];

        function gtag() {
            dataLayer.push(arguments);
        }

        gtag('js', new Date());

        gtag('config', 'G-PYVRSFMDRL');
    </script>

    <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">

    <link rel="stylesheet" href="./assets/css/bulma.min.css">
    <link rel="stylesheet" href="./assets/css/bulma-carousel.min.css">
    <link rel="stylesheet" href="./assets/css/bulma-slider.min.css">
    <link rel="stylesheet" href="./assets/css/fontawesome.all.min.css">
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
    <link rel="stylesheet" href="./assets/css/index.css">
    <link rel="icon" href="./assets/images/favicon.ico">
    <link rel="stylesheet" href="assets/css/prompts.css">

    <script src="assets/js/prompts.js"></script>
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
    <script defer src="./assets/js/fontawesome.all.min.js"></script>
    <script src="./assets/js/bulma-carousel.min.js"></script>
    <script src="./assets/js/bulma-slider.min.js"></script>
    <script src="./assets/js/index.js"></script>
</head>

<body>



    <section class="hero">
        <div class="hero-body">
            <div class="container is-max-desktop">
                <div class="columns is-centered">
                    <div class="column has-text-centered">
                        <h1 class="title is-1 publication-title">Imagine, Verify, Execute: Agentic Exploration with
                            Vision-Language Models</h1>
                        <div class="is-size-5 publication-authors">
                            <span class="author-block">
                                Anonymous</a></span>


                        </div>

                        <div class="is-size-5 publication-authors">
                            <span class="author-block"><sup>1</sup>Anonymous Institution,</span>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </section>

    <!-- Videos -->
    <section class="hero is-light is-small">
        <div class="hero-body">
            <!-- Tangram videos -->
            <div class="container">
                <div id="results-carousel" class="carousel results-carousel">
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/tangram_explr1.mp4" type="video/mp4">
                        </video>
                    </div>
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/tangram_explr2.mp4" type="video/mp4">
                        </video>
                    </div>
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/tangram_explr3.mp4" type="video/mp4">
                        </video>
                    </div>
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/tangram_explr4.mp4" type="video/mp4">
                        </video>
                    </div>

                </div>
            </div>
            <!-- Tangram videos end -->

            <!-- Object videos -->
            <div class="container">
                <div id="results-carousel" class="carousel results-carousel">
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/object_ive_explr1.mp4" type="video/mp4">
                        </video>
                    </div>
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/object_ive_explr2.mp4" type="video/mp4">
                        </video>
                    </div>
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/object_ive_explr3.mp4" type="video/mp4">
                        </video>
                    </div>
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/object_ive_explr4.mp4" type="video/mp4">
                        </video>
                    </div>
                </div>
            </div>
            <!-- Object videos end -->

            <!-- Simulation videos -->
            <div class="container">
                <div id="results-carousel" class="carousel results-carousel">
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/vimabench1.mp4" type="video/mp4">
                        </video>
                    </div>
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/vimabench3.mp4" type="video/mp4">
                        </video>
                    </div>
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/vimabench3.mp4" type="video/mp4">
                        </video>
                    </div>
                    <div class="item">
                        <video poster="" autoplay controls muted loop playsinline height="100%">
                            <source src="assets/videos/vimabench1.mp4" type="video/mp4">
                        </video>
                    </div>
                </div>
                <div class="container">
                    <div class="columns">
                        <div id="top-video-description" class="column has-text-justified">
                            <p>The IVE (Imagine-Verify-Execute) agent autonomously explores Tangram pieces in the real world (top row), commonobjects (middle row), and objects in simulation (bottom row). Across these tasks, IVE converts visual input to semantic scene graphs, imagines novel configurations, verifies their physical feasibility, and executes actions to gather diverse, semantically-grounded data for downstream learning.</p>
                        </div>
                    </div>
                </div>

            </div>
            <!-- Simulation videos end -->

        </div>
    </section>
    <!-- End Videos -->

    <!-- Abstract -->
    <section class="section">
        <div class="container is-max-desktop">
            <h3 class="title is-3 has-text-centered">Abstract</h3>
            <div class="content has-text-justified">
                <p>
                    Exploration is a fundamental challenge of general-purpose robotic learning, particularly in
                    open-ended environments where explicit human guidance or task-specific feedback is limited.
                    Vision-language models (VLMs), which can reason about object semantics, spatial relations,
                    and potential outcomes, offer a promising foundation for guiding exploratory behavior by
                    generating high-level goals or hypothetical transitions. However, their outputs lack
                    grounding, making it difficult to determine whether imagined transitions are physically
                    feasible or informative in the environment. To bridge this gap between imagination and
                    execution, we present IVE(Imagine, Verify, Execute), an agentic exploration framework
                    inspired by human curiosity. In humans, intrinsic motivation frequently emerges from the
                    drive to discover novel scene configurations and to make sense of the environment. This
                    process is often enhanced by verbalizing goals or intentions through language. To enable
                    this human-inspired approach, IVE abstracts RGB-D observations into semantic scene graphs,
                    imagines novel future scenes, predicts their physical plausibility, and executes actions via
                    action tools. We evaluate IVE in both simulated and real-world tabletop environments using a
                    suite of exploration metrics and downstream tasks. The results show that our method produces
                    more diverse and meaningful exploration than RL baselines with intrinsic curiosity.
                    Additionally, the data IVE collects enables downstream learning performance that closely
                    matches that of policies trained on human-collected demonstrations.
                </p>
            </div>
        </div>
    </section>
    <!-- Abstract -->



    <!-- Project video -->
    <section class="section">
        <div class="container is-max-desktop">
            <h3 class="title is-3 has-text-centered">Project video</h3>
            <div class="publication-video">
                <iframe src="assets/videos/ive_video_submission_final.mp4" frameborder="0" allow="encrypted-media"
                    allowfullscreen></iframe>
            </div>
        </div>
    </section>
    <!-- Project video -->


    <section class="section">
        <div class="container is-max-desktop">
            <h3 class="title is-3 has-text-centered">Overview</h3>
            <div>
                <figure class="image is-fullwidth has-text-justified">
                    <img src="./assets/images/overview.png">
                    <figcaption class="has-text-justified">
                        <p>Overview of IVE (Imagine, Verify, Execute). The Scene Describer constructs a scene graph from
                            observations, the Explorer imagines novel configurations guided by memory
                            retrieval, and the Verifier predict the physical plausibility of proposed
                            transitions. Verified plans are executed using action tools. Exploration is
                            structured around semantic reasoning, verification, and physically grounded
                            interaction.</p>
                    </figcaption>
                </figure>
            </div>
        </div>
    </section>



    <!-- Ablation baselines: video -->
    <section class="section">
        <div class="container is-max-desktop">
            <div class="results">
                <figure class="image is-fullwidth has-text-justified">
                    <img src="assets/images/ablations_baselines.png" alt="Ablation baselines">
                    <figcaption class="has-text-justified">
                        <p>Exploring with Embodied Agents: This figure compares the exploration capabilities of our
                            method, IVE, powered by different Vision-Language Models (VLMs) against a human expert. The
                            plots show performance across four key metrics as a function of interaction: (Left) the
                            growth in the number of unique scene graphs discovered, (Middle Left) the entropy of visited
                            states (a measure of diversity), (Middle Right) empowerment (the agent's ability to
                            influence future states), and (Right) information gain (the amount of new information
                            acquired). Notably, IVE, regardless of the VLM used, surpasses the human expert in
                            generating unique scene graphs, achieving higher state diversity, and gaining more
                            information.</p>
                    </figcaption>
            </div>
        </div>
    </section>
    <!-- Ablation baselines: video -->


<section class="section">
    <div class="container is-max-desktop">
        <div class="prompt-prompts-container">
            <h3 class="title is-3 has-text-centered">Prompts</h3>
        </div>
        <br>
        <ul class="box has-background-light p-5">
            <li class="mb-3">
              <span class="icon has-text-primary">
                <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none"
                stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
                <rect x="9" y="9" width="13" height="13" rx="2" ry="2"></rect>
                <path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1"></path>
            </svg>
              </span> 
              <strong>Scene Describer</strong> 
              <button class="button is-small is-light" onclick="toggleFactor(1)">
                <i class="fas fa-chevron-down"></i>
              </button>
              <pre id="factor-1" class="is-hidden mt-2">
## Your Task
You are an expert image analyzer tasked with identifying the **exact** placement and spatial relationships of specific objects. Your job is to generate a scene graph describing these spatial relations **solely** based on the objects’ visible positions in the image.

As an image analyzer, Follow Step 1~3 below.

---

## Step 1: Fill the Answer in QnA Section

---

## Step 2: Iterative Scene Graph Construction
1. Begin with one object.
2. Add one new object at a time, to your partial scene graph.
3. For each newly added object:
- Determine its spatial relation(s) to the objects already in the scene graph.
- **Use only** the Allowed Relations in the scene graph.
- Do not assign more than one relation for the same object pair `(new_object, existing_object) == `(existing_object, new_object)`
- You may introduce multiple relations at once if the new object relates to multiple existing objects.

---

## Step 3: Final Scene Graph Output
1. **Once all objects** have been introduced and verified, compile a **complete scene graph**:
- **List all nodes** (the objects in the final scene).
- **List all verified relations** between pairs of objects, using the Allowed Relations in the scene graph.
2. **Use only** objects from the "Global Object Names."
3. Even if there's missing nodes or edges in a final scene graph (because at least one object is missing), you must still provide a complete **scratch pad** and **scene graph** with existing relations.

---

## Scene Graph Representation

- Nodes: Objects present in the scene.
- Relations: Spatial relationships between object pairs.
- Allowed Relations in the scene graph:
    - **Stacked On**: Object A is physically resting on Object B. This requires clear direct contact—Object A is visibly supported by Object B from below.
    - **Near**: Object A is positioned close to Object B without being stacked. Use this only when the objects are almost touching.

---

## Global Object Names
`<GLOBAL_OBJECTS_HERE>`

---

## Output Format

Please structure your final output exactly as shown below (without the lines). **Use the precise section titles**:

```
-------------
[Step 1: Fill the Answer in QnA Section]
<QNA_FOR_OBJECT_RELATION>

[Step 2: Iterative Scene Graph Construction]

Iteration 1:
- Added obj_a.
- Explanation of how you confirmed its presence in the image.

Iteration 2:
- Added obj_b.
- <obj_b, relation_type, obj_a> or <obj_a, relation_type, obj_b> (include any additional relations or notes)
- Explanation of how you verified this relation.

... (continue until all objects are added and checked)

[Step 3: Final Scene Graph Output]
<start_graph>
Nodes: obj_a, obj_b, ...
Relations: <obj_a, Near, obj_b>, <obj_b, Near obj_c>, <obj_d, Stacked On, obj_c>, ...
<end_graph>
-------------
```
              </pre>
            </li>
            
            <br>

            <li class="mb-3">
              <span class="icon has-text-warning">
                <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none"
                stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
                <circle cx="11" cy="11" r="8"></circle>
                <line x1="21" y1="21" x2="16.65" y2="16.65"></line>
            </svg>
              </span> 
              <strong>Explorer</strong> 
              <button class="button is-small is-light" onclick="toggleFactor(2)">
                <i class="fas fa-chevron-down"></i>
              </button>
              <pre id="factor-2" class="is-hidden mt-2">
## Your Task
You are an expert spatial planner. Given the Current Image, your job is to generate a sequence of actions that discover a new scene configuration—one that has not been seen before.
- In addition to the action sequence, you must provide the predicted future scene graph (desired scene graph) that results from these actions.
- You have two images taken from different camera viewpoints.
- You should provide at most `<NUM_STEPS_HERE>` actions.

---

## Scene Graph Representation

- Nodes: Objects present in the scene.
- Relations: Spatial relationships between object pairs.
- Allowed Relations in the scene graph:
    - **Stacked On**: Object A is physically resting on Object B. This requires clear direct contact—Object A is visibly supported by Object B from below.
    - **Near**: Object A is positioned close to Object B without being stacked. Use this only when the objects are almost touching.

---

## Global Object Names
`<GLOBAL_OBJECTS_HERE>`

---

<ACTION_TYPES>

---

## Current Scene Graph
`<CURRENT_SCENE_GRAPH>`

---

## Scene Graph History

Shows previously visited scene graphs most similar to your current scene.
<SCENEGRAPH_HISTORY>

---

## Action History
`<ACTION_HISTORY>`

---

## Output Format
Your output format should look exactly like the content between the `-----`. **Do not** number the actions. It’s important to wrap the action sequence between `<start_action_sequence>` and `<end_action_sequence>`. Also, write down the predicted future scene graph (desired scene graph - the final arrangement after all actions) between `<start_graph>` and `<end_graph>`.

-----
<start_scratch_pad>
Explain your reasoning:
- Why this is a novel scene
- Why the action sequence makes sense
- If there were oddities or contradictions in the histories, how did you account for possible collisions, suction errors, or clutter?
<end_scratch_pad>

Predict (Desired) Future Scene Graph:
<start_desired_scene_graph>
Nodes: obj_a, obj_b, ...
Relations: <obj_a, Near, obj_b>, <obj_b, Near obj_c>, <obj_d, Stacked On, obj_c>, ...
<end_desired_scene_graph>

Next Action Sequence:
<start_action_sequence>
<ACTION_SEQUENCE_EXAMPLE>
<end_action_sequence>
-----

### Important Considerations

1. Order Matters: Plan your actions so that preconditions are satisfied before you move an object.
2. Scene Boundaries: If an object is near the scene boundary, avoid pushing it further toward the edge or placing new objects in a risky position.
3. Manipulation (Suction) Constraints:
    - The suction can only reliably pick the topmost exposed surface.
    - In cluttered areas, an attempt to move one object may cause unintended collisions or shifts in neighboring objects.
    - Stacking another object on top of an unstable object can lead to the object toppling over.
4. Note: The list of allowed relations in Action Types and the relations used in Scene Graph Representation ([Stacked On, Near]) may differ. Desired Scene Graph should use relations among <SCENEGRAPH_RELATIONS> only, same as other Scene Graphs. Please keep this in mind when planning your actions.
              </pre>
            </li>
            
            <br>

            <li class="mb-3">
              <span class="icon has-text-danger">
                <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none"
                stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
                <polyline points="9 11 12 14 22 4"></polyline>
                <path d="M21 12v7a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V5a2 2 0 0 1 2-2h11"></path>
            </svg>
              </span> 
              <strong>Verifier</strong> 
              <button class="button is-small is-light" onclick="toggleFactor(3)">
                <i class="fas fa-chevron-down"></i>
              </button>
              <pre id="factor-3" class="is-hidden mt-2">
## Your Task
You are a spatial reasoning expert responsible for **verifying action plans** in physically dynamic environments.
You ensure that a proposed sequence of actions logically leads from the current state to the desired scene graph, without triggering unintended outcomes.

You may also provide **targeted suggestions** or, in rare but necessary cases, recommend a **temporary shift to a decluttering strategy**.

---

## Goals

Given the current image (from two camera views), transition history, desired scene graph, and a proposed action sequence:

1. **Simulate** the effect of the action sequence from the current scene
2. **Predict** the resulting scene graph
3. **Compare** the predicted graph with the desired one
4. **Evaluate physical feasibility and execution stability**
5. **Provide a judgment**:
    - Valid and feasible
    - Invalid (with reason)
    - Valid but risky (suggest a targeted fix)
    - Too unstable to proceed (recommend declutter mode)

---

<ACTION_TYPES>

---

## Transition History
A sequence of alternating scene graphs and actions showing the environment's evolution.

`<TRANSITION_HISTORY>`

---

## Output Format

```
-----
<start_scratch_pad>
Step-by-step analysis:
- Simulate and predict the resulting scene graph.

Scene Stability Check:
- Are any objects in clearly unstable or unreachable positions?
- Do previous transitions indicate failures or ambiguous changes?
- Are cluttered zones, deep stacks, or occlusions affecting safety or reliability?

Decision:
- Is the action sequence logically valid and does it produce the desired scene graph?
    → YES or NO

If NO:
- Explain which actions fail and why.
- Point out mismatches or invalid transitions.

If YES but issues are detected:
- Identify objects or areas causing risk (e.g., unstable stacks, blocked objects).
- Suggest fine-grained intervention (e.g., "move obj_A before continuing").

If the environment is severely cluttered and unsafe:
- Recommend a temporary shift to a decluttering mode

<end_scratch_pad>

<start_decision>
YES or NO
<end_decision>

<start_reason>
[If NO: Brief but clear explanation of what failed or was mismatched]
[If YES but risky: Warning message with suggestion, e.g., "Unstable stack: move obj_b before continuing"]
[If YES but too unstable: "Scene too cluttered. Recommend temporary declutter mode."]
[If YES and no issues: Leave this part empty]
<end_reason>
-----
```

---

## Scene Stability Considerations
Clutter or instability **does not always require full decluttering**. Consider recommending targeted fixes first.

#### Examples of Minor Intervention:
- `"obj_b is stacked on obj_a, which is already supporting obj_c. Recommend moving obj_b first to prevent instability."`
- `"obj_d is partially occluded and may be hard to suction. Recommend shifting nearby obj_e first."`

#### Examples of Decluttering (rare):
- `"Multiple overlapping clusters and deep stacks suggest high instability. Recommend decluttering of current layout before further scene exploration."`

              </pre>
            </li>
          </ul>

        <script>
            function toggleFactor(factor) {
                const element = document.getElementById(`factor-${factor}`);
                element.classList.toggle("is-hidden");
            }
        </script>
    </div>

    
</section>

    <!-- BibTeX -->
    <section class="section" id="BibTeX">
        <div class="container is-max-desktop content">
            <h3 class="title">BibTeX</h3>
            <pre><code>@article{anonymous2025ive,
  author    = {Anonymous},
  title     = {Imagine, Verify, Execute: Agentic Exploration with Vision-Language Models},
  journal   = {Under review},
  year      = {2025}
}</code></pre>
        </div>
    </section>
    <!-- BibTeX -->



</body>

</html>