Abstract: Scaling video generation models is believed to be promising in building world models that adhere to fundamental physical laws. However, whether these models can discover physical laws purely from vision can be questioned.
A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios.
In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization.
We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws.
We focus on the scaling behavior of training
diffusion-based video generation models to predict object movements based on initial frames.
Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios.
Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color $>$ size $>$ velocity $>$ shape.
Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws.
Lay Summary: Current video generation models are powerful and able to generate high-fidelity videos, which might help build AI systems that can simulate the future of real-world. However, it's unclear if current AI models, trained solely by watching videos, truly learn these fundamental rules.
In this paper, we explore whether modern AI video models can discover and generalize basic physics principles simply by watching videos. We test the AI's ability to predict object movements in three scenarios: familiar cases (similar to the training data), completely unfamiliar situations, and new combinations of known elements.
Using simplified computer-generated videos of moving and colliding objects, we systematically trained and evaluated video prediction models. Our findings show these models can handle familiar situations perfectly, and they perform reasonably well when combining known factors in new ways. However, they struggle significantly when faced with completely new situations they haven't encountered before.
Further investigation revealed two important insights. First, these AI models do not truly learn general physics principles but instead rely heavily on remembering specific examples they've seen before. Second, when faced with new situations, the models prioritize features in a particular order—focusing first on color, then size, then speed, and finally shape.
Overall, our research highlights that simply training larger AI models with more videos isn't enough. To create AI systems that genuinely understand physics, new approaches are needed beyond just scaling up existing methods.
Link To Code: https://phyworld.github.io/
Primary Area: Deep Learning->Foundation Models
Keywords: video generation, diffusion model, world model
Submission Number: 10756
Loading