Abstract: While recent generative video models have achieved remarkable visual realism and are being
explored as world models, true physical simulation requires mastering both space and time. Current
models can produce visually smooth kinematics, yet they lack a reliable internal motion pulse to
ground these motions in a consistent, real-world time scale. This temporal ambiguity stems from the
common practice of indiscriminately training on videos with vastly different real-world speeds, forcing
them into standardized frame rates. This leads to what we term chronometric hallucination: generated
sequences exhibit ambiguous, unstable, and uncontrollable physical motion speeds. To address this,
we propose Visual Chronometer, a predictor that recovers the Physical Frames Per Second (PhyFPS)
directly from the visual dynamics of an input video. Trained via controlled temporal resampling,
our method estimates the true temporal scale implied by the motion itself, bypassing unreliable
metadata. To systematically quantify this issue, we establish two benchmarks, PhyFPS-Bench-Real
and PhyFPS-Bench-Gen. Our evaluations reveal a harsh reality: state-of-the-art video generators suffer
from severe PhyFPS misalignment and temporal instability. Finally, we demonstrate that applying
PhyFPS corrections significantly improves the human-perceived naturalness of AI-generated videos.
Loading