Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos
Keywords: Video Generation, Evaluation Metrics, Human Action Consistency, Temporal Consistency
TL;DR: We introduce a human-motion-grounded evaluation metric and benchmark that outperforms existing methods by over 68% in alignment with human judgment on action correctness and temporal plausibility in generated videos.
Abstract: Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 4
Loading