VLASim:
World Modelling via VLM-Directed Abstraction and Simulation

Image caption used as input: "A row of colorful wooden blocks lined up on a wooden table with wooden stick attached to a black rotating platform. The platform rotates clockwise and the wooden stick hits the first block as it rotates. Static shot with no camera movement." VLAsim generates a scene abstraction and simulates it with a simulator chosen by the VLM, producing a physically accurate and temporally coherent video. The generated abstract scene representation is interpretable and controllable. We show several examples of user interventions, such as changing camera positions and adding new objects into the scene.



Abstract: Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VLASim, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and co-dependently chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. Furthermore, VLASim can infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing higher-quality simulations across a wider range of dynamic scenarios than prior approaches.


Fine Grained Control


In this scene, we show fine-grained control over a robot arm moving blocks. The abstraction is produced by VLASim, and the user can direct the robot arm to desired positions. We show two different scenes from the Language Table Dataset, with two control sequences per scene.


Further Comparisons with Veo3


In the Veo results, objects move in implausible ways. The double pendulum disconnects, and a third bob appears and disappears. In the whiteboard scene, the core mechanic is not faithfully reproduced, and two additional blocks appear spontaneously. In the cluttered scene, the ball bounces implausibly, the tennis ball tube and pink box move unrealistically, and the roll of tape jumps onto the top of the box. By contrast, VLASim produces physically accurate and temporally coherent results.


Intervention Experiments


Interventions on the liquid on duck scene. First, we swap the duck mesh produced by VLASim with an external asset: the Stanford Bunny. Secondly, we reduce the flow rate of the liquid.


Interventions on the ball hits duck scene. Firstly, we reduce the mass of the duck. Secondly, we change the direction of gravity so that the objects fall upwards.


Interventions on Conway's Game of Life. First, we invert the appearance of the game so that dead cells are illustrated with flowers. Secondly, we change the rules of the game so that a cell survives if it has 1, 2 or 3 neighbours. Note that this second intervention moves the simulation out of the distribution of the training data for the LLM, showing that with intervention we can still achieve a desired outcome which might lie out of distribution for an LLM.


Interventions on the reaction diffusion scene. In both interventions we change the feed and kill rates of the second chemical, resulting in different patterns forming over time.


Interventions on the block domino scene. First, we add another block to the row. Secondly, we interrupt the cascade by firing particles at the block row.


Variability in Simulations


Generating a simulation on the basis of a single image and caption is inherently under-constrained. In our method, the use of an LLM produces inherently diverse simulation outcomes, and the variability matches the uncertainty in the input. Here, we show multiple different simulations generated from the same input image and caption. In the first example, the scene is relatively constrained, so the different simulations are similar to each other - only differing in the final position of the blocks. In the second example, the scene is more ambiguous, and the resulting simulations differ due to differences in ball speed, restitution, and surface friction.


Further Results


Additional results showing scenes with physical interaction. In the first scene we drop a tennis ball into a cluttered arrangement of objects. In the second we set the caption to move the mouse to wake the computer. Importantly, VLASim only segments and modifies the components relevant to this action, leaving the rest of the scene unchanged. In the third example, we demonstrate throwing a ball into the scene with various cluttered objects.


Additional results showing scenes containing physical abstraction. In the first example, from the Aerial Traffic Dataset, the motion of a bus is reduced to a 2D simulation, prompted with the caption "a bus turning right at an intersection". In the second example, a game of Tetris is abstracted from a simple drawing of the game. In the final example, a modified sample from the PhysGen Dataset, a complex scene of a pendulum freely swinging while attached to an accelerating car is correctly modelled. Note that as the car accelerates, the pendulum moves backwards in the car's frame of reference.


Coarse Control


In these examples we demonstrate coarse control. By only changing the caption passed to the model, we can modify the physical dynamics of the scene. In the first example, we provide a caption indicating that a second tennis ball falls from above the stack, resulting in a collision. In the second example, we provide a caption indicating that the camera pans to an overhead view of the stack.


Using Visual Context


Both these simulations were given the same caption: "A diffusion process", but different initial images. The model uses the visual context to seelect appropriate simulators, with the first example simulating the Brownian motion of particles, and the second example simulating the softening of a concentration gradient.


Comparisons with Wan2.2


VLASim generates a scene abstraction and simulates it with a simulator chosen by the VLM, while Wan2.2 and Veo3 directly generate videos. VLASim produces more physically accurate and temporally coherent results than these state-of-the-art methods. We only evaluate on a few examples for Veo3 due to the associated costs of inference. However, these examples clearly show lack of physical plausibility.


In the Wan2.2 result, the duck moves implausibly to the left, and the ball implausibly moves back to the left after the collision.


In the Wan2.2 result, the number of domino blocks changes over time. Additionally, the gap between the blocks do not stop the falling motion from propagating. Finally, the stick on the turntable falls off implausibly at the end.


In the Wan2.2 result, the ball implausibly jumps off the turntable and then jumps back on.


In the Wan2.2 result, an extra pink block appears implausibly.


In the Wan2.2 result, several additional balls appear implausibly.


In the Wan2.2 result, the unstable stack of blocks do not fall even when the blue block pushes on the yellow block.


We also evaluate on Conway's Game of Life, a cellular automaton with simple rules. VLASim correctly simulates the dynamics, while Wan2.2 fails to do so. The caption provided to both methods is "Conway's game of life on a 16 by 9 grid. Each frame constitutes one step of the game. The boundary condition is zero (pixels outside the grid are dead)."


Comparisons with Veo3


In the Veo3 result, the probe changes and a metal end appears. Input caption: "A grabber tool carefully placing a blue wooden block on top of a yellow block which is balanced on a red block forming an L shape. Static shot with no camera movement.".


In the Veo3 result, the number of domino blocks changes over time. Additionally, the color and shape of blocks change over time. Finally, the gap does not stop the falling motion.



Like Wan2.2, Veo3 also fails to simulate Conway's Game of Life correctly.


More Results and Ground Truth Visualisation


We show several results, along with the ground truth videos. Our method produces physically accurate results. Note that there are several valid futures for each scene, so our results do not exactly match the ground truth. We want to emphasise that our results show the correct physical interactions and dynamics, which is the main goal of our work.