<html>
  <head>
    <meta charset="utf-8" />
    <meta
      name="description"
      content="VLASim: World Modelling via VLM-Directed Abstraction and Simulation"
    />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <title>
      VLASim: World Modelling via VLM-Directed Abstraction and Simulation
    </title>

    <link
      href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
      rel="stylesheet"
    />

    <link rel="stylesheet" href="./static/css/bulma.min.css" />
    <link rel="stylesheet" href="./static/css/bulma-carousel.min.css" />
    <link rel="stylesheet" href="./static/css/bulma-slider.min.css" />
    <link rel="stylesheet" href="./static/css/fontawesome.all.min.css" />
    <link
      rel="stylesheet"
      href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css"
    />
    <link rel="stylesheet" href="./static/css/index.css" />

    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
    <script defer src="./static/js/fontawesome.all.min.js"></script>
    <script src="./static/js/bulma-carousel.min.js"></script>
    <script src="./static/js/bulma-slider.min.js"></script>
    <script src="./static/js/index.js"></script>
    <style>
      .video-margin {
        margin-bottom: 20px;
      }
    </style>
    <style>
      .reduce-space {
        margin-bottom: -100px;
      }
    </style>
  </head>
  <body>
    <section class="hero">
      <div class="reduce-space">
        <div class="hero-body">
          <div class="container is-max-desktop">
            <div class="columns is-centered">
              <div class="column has-text-centered">
                <h1 class="title is-1 publication-title">
                  <span class="main-title">VLASim:</span><br /><span
                    class="sub-title"
                    >World Modelling via VLM-Directed Abstraction and
                    Simulation</span
                  >
                </h1>
              </div>
            </div>
          </div>
        </div>
      </div>
      <!-- </div> -->
    </section>

    <section class="section">
      <div class="container is-max-desktop">
        <!-- Abstract. -->
        <div class="rows is-centered">
          <figure>
            <video id="teaser" autoplay muted loop playsinline width="100%">
              <source src="videos/teaser.mp4" type="video/mp4" />
            </video>
          </figure>
          <div class="content has-text-justified">
            Image caption used as input: "A row of colorful wooden blocks lined
            up on a wooden table with wooden stick attached to a black rotating
            platform. The platform rotates clockwise and the wooden stick hits
            the first block as it rotates. Static shot with no camera movement."
            VLAsim generates a scene abstraction and simulates it with a
            simulator chosen by the VLM, producing a physically accurate and
            temporally coherent video. The generated abstract scene
            representation is interpretable and controllable. We show several
            examples of user interventions, such as changing camera positions
            and adding new objects into the scene.
          </div>
          <div class="content has-text-justified">
            <p>
              <br /><br />
              <b>Abstract</b>: Generative video models, a leading approach to
              world modeling, face fundamental limitations. They often violate
              physical and logical rules, lack interactivity, and operate as
              opaque black boxes ill-suited for building structured, queryable
              worlds. To overcome these challenges, we propose a new paradigm
              focused on distilling an image caption pair into a tractable,
              abstract representation optimized for simulation. We introduce
              VLASim, a framework where a Vision-Language Model (VLM) acts as an
              intelligent agent to orchestrate this process. The VLM
              autonomously constructs a grounded (2D or 3D) scene representation
              by selecting from a suite of vision tools, and co-dependently
              chooses a compatible physics simulator (e.g., rigid body, fluid)
              to act upon it. Furthermore, VLASim can infer latent dynamics from
              the static scene to predict plausible future states. Our
              experiments show that this combination of intelligent abstraction
              and adaptive simulation results in a versatile world model capable
              of producing higher-quality simulations across a wider range of
              dynamic scenarios than prior approaches.
            </p>
          </div>

          <br />
        </div>
        <!--/ Abstract. -->
      </div>
      <div class="container is-max-desktop">
        <div class="rows is-centered">
          <div class="row">
            <h2 class="title is-3 has-text-centered" style="color: red">
              Fine Grained Control
            </h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/robot_a.mp4" type="video/mp4" />
            </video>
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/robot_b.mp4" type="video/mp4" />
            </video>
            <h2 class="" style="color: red">
              In this scene, we show fine-grained control over a robot arm
              moving blocks. The abstraction is produced by VLASim, and the user
              can direct the robot arm to desired positions. We show two
              different scenes from the Language Table Dataset, with two control
              sequences per scene.
            </h2>
          </div>
          <br />
          <div class="row">
            <h2 class="title is-3 has-text-centered" style="color: red">
              Further Comparisons with Veo3
            </h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/complex.mp4" type="video/mp4" />
            </video>
            <h2 class="" style="color: red">
              In the Veo results, objects move in implausible ways. The double
              pendulum disconnects, and a third bob appears and disappears. In
              the whiteboard scene, the core mechanic is not faithfully
              reproduced, and two additional blocks appear spontaneously. In the
              cluttered scene, the ball bounces implausibly, the tennis ball
              tube and pink box move unrealistically, and the roll of tape jumps
              onto the top of the box. By contrast, VLASim produces physically
              accurate and temporally coherent results.
            </h2>
          </div>
          <br />

          <div class="row">
            <h2 class="title is-3 has-text-centered" style="color: red">
              Intervention Experiments
            </h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source
                src="videos/intervention_liquid_on_duck.mp4"
                type="video/mp4"
              />
            </video>
            <h2 class="" style="color: red">
              Interventions on the liquid on duck scene. First, we swap the duck
              mesh produced by VLASim with an external asset: the Stanford
              Bunny. Secondly, we reduce the flow rate of the liquid.
            </h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source
                src="videos/intervention_ball_hits_duck.mp4"
                type="video/mp4"
              />
            </video>
            <h2 class="" style="color: red">
              Interventions on the ball hits duck scene. Firstly, we reduce the
              mass of the duck. Secondly, we change the direction of gravity so
              that the objects fall upwards.
            </h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/intervention_conway.mp4" type="video/mp4" />
            </video>
            <h2 class="" style="color: red">
              Interventions on Conway's Game of Life. First, we invert the
              appearance of the game so that dead cells are illustrated with
              flowers. Secondly, we change the rules of the game so that a cell
              survives if it has 1, 2 or 3 neighbours. Note that this second
              intervention moves the simulation out of the distribution of the
              training data for the LLM, showing that with intervention we can
              still achieve a desired outcome which might lie out of
              distribution for an LLM.
            </h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source
                src="videos/intervention_reaction_diffusion.mp4"
                type="video/mp4"
              />
            </video>
            <h2 class="" style="color: red">
              Interventions on the reaction diffusion scene. In both
              interventions we change the feed and kill rates of the second
              chemical, resulting in different patterns forming over time.
            </h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source
                src="videos/intervention_block_domino.mp4"
                type="video/mp4"
              />
            </video>
            <h2 class="" style="color: red">
              Interventions on the block domino scene. First, we add another
              block to the row. Secondly, we interrupt the cascade by firing
              particles at the block row.
            </h2>
          </div>
          <br />

          <div class="row">
            <h2 class="title is-3 has-text-centered" style="color: red">
              Variability in Simulations
            </h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/variability.mp4" type="video/mp4" />
            </video>
            <h2 class="" style="color: red">
              Generating a simulation on the basis of a single image and caption
              is inherently under-constrained. In our method, the use of an LLM
              produces inherently diverse simulation outcomes, and the
              variability matches the uncertainty in the input. Here, we show
              multiple different simulations generated from the same input image
              and caption. In the first example, the scene is relatively
              constrained, so the different simulations are similar to each
              other - only differing in the final position of the blocks. In the
              second example, the scene is more ambiguous, and the resulting
              simulations differ due to differences in ball speed, restitution,
              and surface friction.
            </h2>
          </div>
          <br />
        </div>

        <div class="row">
          <h2 class="title is-3 has-text-centered" style="color: red">
            Further Results
          </h2>
        </div>
        <br />

        <div class="row">
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/complex_1.mp4" type="video/mp4" />
          </video>
          <h2 class="" style="color: red">
            Additional results showing scenes with physical interaction. In the
            first scene we drop a tennis ball into a cluttered arrangement of
            objects. In the second we set the caption to move the mouse to wake
            the computer. Importantly, VLASim only segments and modifies the
            components relevant to this action, leaving the rest of the scene
            unchanged. In the third example, we demonstrate throwing a ball into
            the scene with various cluttered objects.
          </h2>
        </div>
        <br />

        <div class="row">
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/complex_2.mp4" type="video/mp4" />
          </video>
          <h2 class="" style="color: red">
            Additional results showing scenes containing physical abstraction.
            In the first example, from the Aerial Traffic Dataset, the motion of
            a bus is reduced to a 2D simulation, prompted with the caption "a
            bus turning right at an intersection". In the second example, a game
            of Tetris is abstracted from a simple drawing of the game. In the
            final example, a modified sample from the PhysGen Dataset, a complex
            scene of a pendulum freely swinging while attached to an
            accelerating car is correctly modelled. Note that as the car
            accelerates, the pendulum moves backwards in the car's frame of
            reference.
          </h2>
        </div>
        <br />

        <div class="row">
          <h2 class="title is-3 has-text-centered" style="color: red">
            Coarse Control
          </h2>
        </div>
        <br />

        <div class="row">
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/book_stack.mp4" type="video/mp4" />
          </video>
          <h2 class="" style="color: red">
            In these examples we demonstrate coarse control. By only changing
            the caption passed to the model, we can modify the physical dynamics
            of the scene. In the first example, we provide a caption indicating
            that a second tennis ball falls from above the stack, resulting in a
            collision. In the second example, we provide a caption indicating
            that the camera pans to an overhead view of the stack.
          </h2>
        </div>
        <br />

        <div class="row">
          <h2 class="title is-3 has-text-centered" style="color: red">
            Using Visual Context
          </h2>
        </div>
        <br />

        <div class="row">
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source
              src="videos/visual_context_diffusion.mp4"
              type="video/mp4"
            />
          </video>
          <h2 class="" style="color: red">
            Both these simulations were given the same caption: "A diffusion
            process", but different initial images. The model uses the visual
            context to seelect appropriate simulators, with the first example
            simulating the Brownian motion of particles, and the second example
            simulating the softening of a concentration gradient.
          </h2>
        </div>
        <br />

        <div class="row">
          <h2 class="title is-3 has-text-centered">Comparisons with Wan2.2</h2>
        </div>
        <br />

        <div class="row">
          <h2 class="">
            VLASim generates a scene abstraction and simulates it with a
            simulator chosen by the VLM, while Wan2.2 and Veo3 directly generate
            videos. VLASim produces more physically accurate and temporally
            coherent results than these state-of-the-art methods. We only
            evaluate on a few examples for Veo3 due to the associated costs of
            inference. However, these examples clearly show lack of physical
            plausibility.
          </h2>
        </div>
        <br />

        <div class="row">
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/Wan/0.mp4" type="video/mp4" />
          </video>
          <h2 class="">
            In the Wan2.2 result, the duck moves implausibly to the left, and
            the ball implausibly moves back to the left after the collision.
          </h2>
        </div>
        <br />

        <div class="row">
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/Wan/2.mp4" type="video/mp4" />
          </video>
          <h2 class="">
            In the Wan2.2 result, the number of domino blocks changes over time.
            Additionally, the gap between the blocks do not stop the falling
            motion from propagating. Finally, the stick on the turntable falls
            off implausibly at the end.
          </h2>
        </div>
        <br />

        <div class="row">
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/Wan/3.mp4" type="video/mp4" />
          </video>
          <h2 class="">
            In the Wan2.2 result, the ball implausibly jumps off the turntable
            and then jumps back on.
          </h2>
        </div>
        <br />

        <div class="row">
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/Wan/4.mp4" type="video/mp4" />
          </video>
          <h2 class="">
            In the Wan2.2 result, an extra pink block appears implausibly.
          </h2>
        </div>
        <br />

        <div class="row">
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/Wan/5.mp4" type="video/mp4" />
          </video>
          <h2 class="">
            In the Wan2.2 result, several additional balls appear implausibly.
          </h2>
        </div>
        <br />

        <div class="row">
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/Wan/6.mp4" type="video/mp4" />
          </video>
          <h2 class="">
            In the Wan2.2 result, the unstable stack of blocks do not fall even
            when the blue block pushes on the yellow block.
          </h2>
        </div>
        <br />

        <div class="row">
          <h2 class="">
            We also evaluate on Conway's Game of Life, a cellular automaton with
            simple rules. VLASim correctly simulates the dynamics, while Wan2.2
            fails to do so. The caption provided to both methods is "Conway's
            game of life on a 16 by 9 grid. Each frame constitutes one step of
            the game. The boundary condition is zero (pixels outside the grid
            are dead)."
          </h2>
        </div>
        <br />

        <div class="row">
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/Conway/0.mp4" type="video/mp4" />
          </video>
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/Conway/1.mp4" type="video/mp4" />
          </video>
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/Conway/2.mp4" type="video/mp4" />
          </video>
          <video
            id="main"
            class="video-margin"
            autoplay
            muted
            loop
            playsinline
            width="100%"
          >
            <source src="videos/Conway/3.mp4" type="video/mp4" />
          </video>
        </div>
        <div class="rows is-centered">
          <div class="row">
            <h2 class="title is-3 has-text-centered">Comparisons with Veo3</h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/Veo/0.mp4" type="video/mp4" />
            </video>
            <h2 class="">
              In the Veo3 result, the probe changes and a metal end appears.
              Input caption: "A grabber tool carefully placing a blue wooden
              block on top of a yellow block which is balanced on a red block
              forming an L shape. Static shot with no camera movement.".
            </h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/Veo/1.mp4" type="video/mp4" />
            </video>
            <h2 class="">
              In the Veo3 result, the number of domino blocks changes over time.
              Additionally, the color and shape of blocks change over time.
              Finally, the gap does not stop the falling motion.
            </h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/Veo/c0.mp4" type="video/mp4" />
            </video>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/Veo/c1.mp4" type="video/mp4" />
            </video>
            <h2 class="">
              Like Wan2.2, Veo3 also fails to simulate Conway's Game of Life
              correctly.
            </h2>
          </div>
        </div>
      </div>
      <br />

      <div class="container is-max-desktop">
        <div class="rows is-centered">
          <div class="row">
            <h2 class="title is-3 has-text-centered">
              More Results and Ground Truth Visualisation
            </h2>
          </div>
          <br />

          <div class="row">
            <h2 class="">
              We show several results, along with the ground truth videos. Our
              method produces physically accurate results. Note that there are
              several valid futures for each scene, so our results do not
              exactly match the ground truth. We want to emphasise that our
              results show the correct physical interactions and dynamics, which
              is the main goal of our work.
            </h2>
          </div>
          <br />

          <div class="row">
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/results_0.mp4" type="video/mp4" />
            </video>
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/results_1.mp4" type="video/mp4" />
            </video>
            <video
              id="main"
              class="video-margin"
              autoplay
              muted
              loop
              playsinline
              width="100%"
            >
              <source src="videos/results_2.mp4" type="video/mp4" />
            </video>
          </div>
        </div>
      </div>
    </section>

    <nav class="navbar is-white" role="navigation" aria-label="main navigation">
      <div class="navbar-brand">
        <a
          role="button"
          class="navbar-burger"
          aria-label="menu"
          aria-expanded="false"
        >
          <span aria-hidden="true"></span>
          <span aria-hidden="true"></span>
          <span aria-hidden="true"></span>
        </a>
      </div>
    </nav>

    <footer class="footer">
      <div class="container">
        <div class="columns is-centered">
          <div class="column is-8">
            <div class="content has-text-centered">
              <p>
                Website source based on
                <a href="https://github.com/nerfies/nerfies.github.io"
                  >this source code</a
                >.
              </p>
            </div>
          </div>
        </div>
      </div>
    </footer>
  </body>
</html>
