Submission Type: Short Papers (up to 4 pages)
Keywords: Video Generation, Evaluation Metrics, Human Action Consistency, Temporal Consistency
TL;DR: We introduce a human-motion-grounded evaluation metric and benchmark that outperforms existing methods by over 68% in alignment with human judgment on action correctness and temporal plausibility in generated videos.
Abstract: Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark.
Submission Number: 12
Loading