Direct Motion Models for Assessing Generated Videos

Kelsey R Allen; Carl Doersch; Guangyao Zhou; Mohammed Suhail; Danny Driess; Ignacio Rocco; Yulia Rubanova; Thomas Kipf; Mehdi S. M. Sajjadi; Kevin Patrick Murphy; Joao Carreira; Sjoerd van Steenkiste

Direct Motion Models for Assessing Generated Videos

Kelsey R Allen, Carl Doersch, Guangyao Zhou, Mohammed Suhail, Danny Driess, Ignacio Rocco, Yulia Rubanova, Thomas Kipf, Mehdi S. M. Sajjadi, Kevin Patrick Murphy, Joao Carreira, Sjoerd van Steenkiste

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a novel approach based on auto-encoding point tracks that can be used to evaluate motion in generated videos.

Abstract: A current limitation of video generative video models is that they generate plausible looking frames, but poor motion --- an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: trajan-paper.github.io.

Lay Summary: Current artificial intelligence models that create videos often make believable-looking individual frames, but the way things move in the videos isn't very realistic. Existing ways of checking video quality don't do a good job of spotting these poor movements, and usually require access to a whole set of videos rather than being applicable to just one. We created a new way to measure video quality that focuses specifically on how well objects move and interact. Our method works by tracking points on objects throughout the video and using this information to understand the motion. This allows us to see how realistic the movement is, even for individual videos. We found that our new approach, which uses these tracked points, is much better at detecting weird or unnatural movements in computer-generated videos compared to other methods. It also does a better job of matching what people think looks realistic and consistent in videos made by AI. Additionally, our method can help pinpoint exactly where in a video the movement looks wrong, which makes it easier to understand the errors being made.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: trajan-paper.github.io

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: metrics, videos, motion, point tracking

Submission Number: 12700

Loading