VLASim:
World Modelling via VLM-Directed Abstraction and Simulation from a Single Image

Image caption used as input: "A row of colorful wooden blocks lined up on a wooden table with wooden stick attached to a black rotating platform. The platform rotates clockwise and the wooden stick hits the first block as it rotates. Static shot with no camera movement." VLAsim generates a scene abstraction and simulates it with a simulator chosen by the VLM, producing a physically accurate and temporally coherent video. The generated abstract scene representation is interpretable and controllable. We show several examples of user interventions, such as changing camera positions and adding new objects into the scene.



Abstract: Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling a single image into a tractable, abstract representation optimized for simulation. We introduce VLASim, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and co-dependently chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. Furthermore, VLASim can infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing higher-quality simulations across a wider range of dynamic scenarios than prior approaches.


Comparisons with Wan2.2


VLASim generates a scene abstraction and simulates it with a simulator chosen by the VLM, while Wan2.2 and Veo3 directly generate videos. VLASim produces more physically accurate and temporally coherent results than these state-of-the-art methods. We only evaluate on a few examples for Veo3 due to the associated costs of inference. However, these examples clearly show lack of physical plausibility.


In the Wan2.2 result, the duck moves implausibly to the left, and the ball implausibly moves back to the left after the collision.

In the Wan2.2 result, the number of domino blocks changes over time. Additionally, the gap between the blocks do not stop the falling motion from propagating. Finally, the stick on the turntable falls off implausibly at the end.

In the Wan2.2 result, the ball implausibly jumps off the turntable and then jumps back on.

In the Wan2.2 result, an extra pink block appears implausibly.

In the Wan2.2 result, several additional balls appear implausibly.

In the Wan2.2 result, the unstable stack of blocks do not fall even when the blue block pushes on the yellow block.

We also evaluate on Conway's Game of Life, a cellular automaton with simple rules. VLASim correctly simulates the dynamics, while Wan2.2 fails to do so. The caption provided to both methods is "Conway's game of life on a 16 by 9 grid. Each frame constitutes one step of the game. The boundary condition is zero (pixels outside the grid are dead)."


Comparisons with Wan2.2

In the Veo3 result, the probe changes and a metal end appears. Input caption: "A grabber tool carefully placing a blue wooden block on top of a yellow block which is balanced on a red block forming an L shape. Static shot with no camera movement.".

In the Veo3 result, the number of domino blocks changes over time. Additionally, the color and shape of blocks change over time. Finally, the gap does not stop the falling motion.



Like Wan2.2, Veo3 also fails to simulate Conway's Game of Life correctly.



More Results and Ground Truth Visualisation


We show several results, along with the ground truth videos. Our method produces physically accurate results. Note that there are several valid futures for each scene, so our results do not exactly match the ground truth. We want to emphasise that our results show the correct physical interactions and dynamics, which is the main goal of our work.