Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Published: 25 Sept 2024, Last Modified: 27 Sept 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. % Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignoring the dynamics of video content. % Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. % In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the \textit{dynamics dimension} to evaluate T2V models. % For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. % Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: \textit{dynamics range}, \textit{dynamics controllability}, and \textit{dynamics-based quality}. % Experiments show that DEVIL achieves a Pearson correlation exceeding 90\% with human ratings, demonstrating its potential to advance T2V generation models.
Loading