Keywords: Video generation, Diffusion model
Abstract: Current video diffusion models generate visually compelling content but often violate
basic laws of physics, producing subtle artifacts like rubber-sheet deformations and
inconsistent object motion. We introduce a frequency-domain physics prior that improves
motion plausibility without modifying model architectures. Our method decomposes common
rigid motions (translation, rotation, scaling) into lightweight spectral losses,
requiring only 2.7% of frequency coefficients while preserving 97%+ of spectral energy.
Applied to Open-Sora, MVDIT, and Hunyuan, our approach improves both motion accuracy and action recognition by ~11\% on average on OpenVID-1M (relative), while maintaining visual quality. User studies show 74--83% preference for our physics-enhanced videos. It also reduces warping error by 22--37% (depending on the backbone) and improves temporal consistency scores. These results indicate that simple, global spectral cues are an effective drop-in regularizer for physically plausible motion in video diffusion.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 4165
Loading