Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

Xavier Thomas; Youngsun Lim; Ananya Srinivasan; Audrey Zheng; Deepti Ghadiyaram

Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram

Published: 24 Mar 2026, Last Modified: 24 Mar 2026CVPR 2026 Workshop VGBEEveryoneRevisionsBibTeXCC BY 4.0

Submission Type: Short Papers (up to 4 pages)

Keywords: Video Generation, Evaluation Metrics, Human Action Consistency, Temporal Consistency

TL;DR: We introduce a human-motion-grounded evaluation metric and benchmark that outperforms existing methods by over 68% in alignment with human judgment on action correctness and temporal plausibility in generated videos.

Abstract: Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark.

Submission Number: 12

Loading