Evaluation of Text-to-Video Generation Models: A Dynamics Perspective
Abstract: Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models.
%
Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignoring the dynamics of video content.
%
Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts.
%
In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the \textit{dynamics dimension} to evaluate T2V models.
%
For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video.
%
Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: \textit{dynamics range}, \textit{dynamics controllability}, and \textit{dynamics-based quality}.
%
Experiments show that DEVIL achieves a Pearson correlation exceeding 90\% with human ratings, demonstrating its potential to advance T2V generation models.
Loading