Track: long paper (up to 8 pages)
Keywords: video generation, physics simulation, benchmark, rigid-body dynamics, world models, evaluation protocol, diffusion transformers
TL;DR: We introduce RigidBench, a photorealistic physics benchmark with exact ground-truth trajectories that reveals trajectory accuracy and perceptual quality are uncorrelated across seven video generation models.
Abstract: Video generation models are increasingly deployed as world model backbones for physical AI, yet their ability to predict rigid-body dynamics remains unreliable. Existing benchmarks either lack precise ground-truth annotations (relying on VLM judgment) or render synthetic primitives against plain backgrounds, introducing a visual domain gap from natural video. We introduce RigidBench, a benchmark combining Blender physics simulation with photorealistic interior scenes to provide exact 3D trajectories, segmentation masks, and depth maps across ten rigid-body physics tasks. Our evaluation protocol spans object localization, trajectory tracking, depth consistency, and perceptual quality, enabling controlled comparison across models. Evaluating seven models spanning open-source diffusion transformers and closed-source commercial systems, we find that trajectory accuracy and perceptual quality are essentially uncorrelated (r=0.002): models that best predict object motion often score worst on perceptual metrics. This demonstrates that standard video quality metrics cannot assess physical understanding, motivating evaluation with precise physics annotations. We further show that fine-tuning on RigidBench data improves physics prediction on held-out tasks, suggesting a path toward more physically grounded video generation.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 76
Loading