TL;DR: This paper explores physics post-training for video diffusion models and release a novel benchmark for evaluating video diffusion model's physical understanding ability.
Abstract: Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development. Code is available at this repository: https://github.com/vision-x-nyu/pisa-experiments.
Lay Summary: Artificial intelligence can now create realistic-looking videos, but it still doesn’t fully understand how the real world works. For example, if you ask a computer to generate a video of an object falling, it might make it look good—but the object might not fall the way it would in real life.
In this work, we focus on teaching computers to better understand simple physical rules like gravity. We do this by giving them extra practice: first, we show them some videos that follow real physics, and then we gently correct their mistakes using a special scoring system that rewards more realistic behavior.
Even with this training, we find that these programs can still make errors in new situations. To help researchers check how well these systems understand physics, we’ve also built a testing tool and made everything publicly available online.
Link To Code: https://github.com/vision-x-nyu/pisa-experiments
Primary Area: Applications->Computer Vision
Keywords: video generation models, intuitive physics, world model
Submission Number: 7650
Loading