Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

Xavier Thomas; Youngsun Lim; Ananya Srinivasan; Audrey Zheng; Deepti Ghadiyaram

Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram

Published: 07 May 2026, Last Modified: 07 May 2026PhysHuman Workshop @ CVPR 2026 OralEveryoneRevisionsCC BY 4.0

Keywords: Video Generation, Evaluation Metrics, Human Action Consistency, Temporal Consistency

TL;DR: We introduce a human-motion-grounded evaluation metric and benchmark that outperforms existing methods by over 68% in alignment with human judgment on action correctness and temporal plausibility in generated videos.

Abstract: Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 4

Loading