TL;DR: A 2-step I2V sample can preserve physics better than a 50-step output. PhaseLock locks this early motion prior into final generation, improving physical consistency by 6.2 points with 1.06x time and 1.02x memory.
Abstract: Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by $\approx 18\%$ from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead ($1.06\times$ time, $1.02\times$ memory) and reduced reliance on expensive external guidance methods ($\sim5\times$ time).
Lay Summary: AI video models can now create visually impressive videos, but they often make physically impossible mistakes, such as objects moving in the wrong direction, disappearing, or ignoring gravity. We found a surprising reason for this problem: in image-to-video diffusion models, a very early, blurry video generated in just a few steps can preserve the correct motion better than the final, polished video. In other words, the model often finds a reasonable physical motion early, but later visual refinement can accidentally overwrite it.
We propose PhaseLock, a training-free method that keeps this early motion information and uses it to guide the final high-quality generation. Instead of adding new physics simulators or retraining the model, PhaseLock reuses what the model already knows at the beginning of generation. Across several video models, this improves physical consistency with very little extra computation. This suggests that better physical video generation may not always require larger models or more inference, but better preservation of the motion information that models already capture.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/dnwjddl/phaselock/tree/main/code
Primary Area: Applications->Computer Vision
Keywords: Video Generation; Diffusion Transformer;
Originally Submitted PDF: pdf
Submission Number: 12936
Loading