RigidBench: Evaluating Rigid-Body Physics in Video Generation Models

Swarnim Jain; Shangzhe Wu

RigidBench: Evaluating Rigid-Body Physics in Video Generation Models

Swarnim Jain, Shangzhe Wu

Published: 02 Mar 2026, Last Modified: 15 Apr 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0

Keywords: video generation, world models, physics simulation, benchmark, rigid-body, dynamics, video prediction, synthetic data, fine-tuning

TL;DR: A photorealistic physics benchmark with ground-truth trajectories, segmentation masks, and depth maps for evaluating and training video world models.

Abstract: Video generation models are increasingly deployed as world model backbones for physical AI, yet their ability to predict rigid-body dynamics remains unreliable. Existing benchmarks either lack precise ground-truth annotations (relying on VLM judgment) or use synthetic primitives that create domain gaps from natural video. We introduce RigidBench, a benchmark combining Blender physics simulation with photorealistic scenes to provide exact 3D trajectories, segmentation masks, and depth maps across ten physics tasks. Evaluating seven leading models, we find that trajectory accuracy and perceptual quality are poorly correlated: models that best predict object motion often score worst on perceptual metrics. This decoupling demonstrates that standard video quality metrics cannot assess physics understanding, motivating the need for benchmarks with precise physics annotations. We also show that fine-tuning on RigidBench data improves physics prediction, suggesting a path toward more physically grounded world models.

Submission Number: 112

Loading