VLASim: World Modelling via VLM-Directed Abstraction and
Simulation
Image caption used as input: "A row of colorful wooden blocks lined
up on a wooden table with wooden stick attached to a black rotating
platform. The platform rotates clockwise and the wooden stick hits
the first block as it rotates. Static shot with no camera movement."
VLAsim generates a scene abstraction and simulates it with a
simulator chosen by the VLM, producing a physically accurate and
temporally coherent video. The generated abstract scene
representation is interpretable and controllable. We show several
examples of user interventions, such as changing camera positions
and adding new objects into the scene.
Abstract: Generative video models, a leading approach to
world modeling, face fundamental limitations. They often violate
physical and logical rules, lack interactivity, and operate as
opaque black boxes ill-suited for building structured, queryable
worlds. To overcome these challenges, we propose a new paradigm
focused on distilling an image caption pair into a tractable,
abstract representation optimized for simulation. We introduce
VLASim, a framework where a Vision-Language Model (VLM) acts as an
intelligent agent to orchestrate this process. The VLM
autonomously constructs a grounded (2D or 3D) scene representation
by selecting from a suite of vision tools, and co-dependently
chooses a compatible physics simulator (e.g., rigid body, fluid)
to act upon it. Furthermore, VLASim can infer latent dynamics from
the static scene to predict plausible future states. Our
experiments show that this combination of intelligent abstraction
and adaptive simulation results in a versatile world model capable
of producing higher-quality simulations across a wider range of
dynamic scenarios than prior approaches.
Fine Grained Control
In this scene, we show fine-grained control over a robot arm
moving blocks. The abstraction is produced by VLASim, and the user
can direct the robot arm to desired positions. We show two
different scenes from the Language Table Dataset, with two control
sequences per scene.
Further Comparisons with Veo3
In the Veo results, objects move in implausible ways. The double
pendulum disconnects, and a third bob appears and disappears. In
the whiteboard scene, the core mechanic is not faithfully
reproduced, and two additional blocks appear spontaneously. In the
cluttered scene, the ball bounces implausibly, the tennis ball
tube and pink box move unrealistically, and the roll of tape jumps
onto the top of the box. By contrast, VLASim produces physically
accurate and temporally coherent results.
Intervention Experiments
Interventions on the liquid on duck scene. First, we swap the duck
mesh produced by VLASim with an external asset: the Stanford
Bunny. Secondly, we reduce the flow rate of the liquid.
Interventions on the ball hits duck scene. Firstly, we reduce the
mass of the duck. Secondly, we change the direction of gravity so
that the objects fall upwards.
Interventions on Conway's Game of Life. First, we invert the
appearance of the game so that dead cells are illustrated with
flowers. Secondly, we change the rules of the game so that a cell
survives if it has 1, 2 or 3 neighbours. Note that this second
intervention moves the simulation out of the distribution of the
training data for the LLM, showing that with intervention we can
still achieve a desired outcome which might lie out of
distribution for an LLM.
Interventions on the reaction diffusion scene. In both
interventions we change the feed and kill rates of the second
chemical, resulting in different patterns forming over time.
Interventions on the block domino scene. First, we add another
block to the row. Secondly, we interrupt the cascade by firing
particles at the block row.
Variability in Simulations
Generating a simulation on the basis of a single image and caption
is inherently under-constrained. In our method, the use of an LLM
produces inherently diverse simulation outcomes, and the
variability matches the uncertainty in the input. Here, we show
multiple different simulations generated from the same input image
and caption. In the first example, the scene is relatively
constrained, so the different simulations are similar to each
other - only differing in the final position of the blocks. In the
second example, the scene is more ambiguous, and the resulting
simulations differ due to differences in ball speed, restitution,
and surface friction.
Further Results
Additional results showing scenes with physical interaction. In the
first scene we drop a tennis ball into a cluttered arrangement of
objects. In the second we set the caption to move the mouse to wake
the computer. Importantly, VLASim only segments and modifies the
components relevant to this action, leaving the rest of the scene
unchanged. In the third example, we demonstrate throwing a ball into
the scene with various cluttered objects.
Additional results showing scenes containing physical abstraction.
In the first example, from the Aerial Traffic Dataset, the motion of
a bus is reduced to a 2D simulation, prompted with the caption "a
bus turning right at an intersection". In the second example, a game
of Tetris is abstracted from a simple drawing of the game. In the
final example, a modified sample from the PhysGen Dataset, a complex
scene of a pendulum freely swinging while attached to an
accelerating car is correctly modelled. Note that as the car
accelerates, the pendulum moves backwards in the car's frame of
reference.
Coarse Control
In these examples we demonstrate coarse control. By only changing
the caption passed to the model, we can modify the physical dynamics
of the scene. In the first example, we provide a caption indicating
that a second tennis ball falls from above the stack, resulting in a
collision. In the second example, we provide a caption indicating
that the camera pans to an overhead view of the stack.
Using Visual Context
Both these simulations were given the same caption: "A diffusion
process", but different initial images. The model uses the visual
context to seelect appropriate simulators, with the first example
simulating the Brownian motion of particles, and the second example
simulating the softening of a concentration gradient.
Comparisons with Wan2.2
VLASim generates a scene abstraction and simulates it with a
simulator chosen by the VLM, while Wan2.2 and Veo3 directly generate
videos. VLASim produces more physically accurate and temporally
coherent results than these state-of-the-art methods. We only
evaluate on a few examples for Veo3 due to the associated costs of
inference. However, these examples clearly show lack of physical
plausibility.
In the Wan2.2 result, the duck moves implausibly to the left, and
the ball implausibly moves back to the left after the collision.
In the Wan2.2 result, the number of domino blocks changes over time.
Additionally, the gap between the blocks do not stop the falling
motion from propagating. Finally, the stick on the turntable falls
off implausibly at the end.
In the Wan2.2 result, the ball implausibly jumps off the turntable
and then jumps back on.
In the Wan2.2 result, an extra pink block appears implausibly.
In the Wan2.2 result, several additional balls appear implausibly.
In the Wan2.2 result, the unstable stack of blocks do not fall even
when the blue block pushes on the yellow block.
We also evaluate on Conway's Game of Life, a cellular automaton with
simple rules. VLASim correctly simulates the dynamics, while Wan2.2
fails to do so. The caption provided to both methods is "Conway's
game of life on a 16 by 9 grid. Each frame constitutes one step of
the game. The boundary condition is zero (pixels outside the grid
are dead)."
Comparisons with Veo3
In the Veo3 result, the probe changes and a metal end appears.
Input caption: "A grabber tool carefully placing a blue wooden
block on top of a yellow block which is balanced on a red block
forming an L shape. Static shot with no camera movement.".
In the Veo3 result, the number of domino blocks changes over time.
Additionally, the color and shape of blocks change over time.
Finally, the gap does not stop the falling motion.
Like Wan2.2, Veo3 also fails to simulate Conway's Game of Life
correctly.
More Results and Ground Truth Visualisation
We show several results, along with the ground truth videos. Our
method produces physically accurate results. Note that there are
several valid futures for each scene, so our results do not
exactly match the ground truth. We want to emphasise that our
results show the correct physical interactions and dynamics, which
is the main goal of our work.