Cycle Consistency Based Method for Learning Disentangled Representation for Stochastic Video Prediction

Ujjwal Tiwari, P. Aditya Sreekar, Anoop M. Namboodiri

Published: 01 Jan 2022, Last Modified: 24 Jun 2025ICIAP (3) 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video frame prediction is an interesting computer vision problem of predicting the future frames of a video sequence from a given set of context frames. Video prediction models have found wide-scale perspective applications in autonomous navigation, representation learning, and healthcare. However, predicting future frames is challenging due to the high dimensional and stochastic nature of video data. This work proposes a novel cycle consistency loss to disentangle video representation into a low dimensional time-dependent pose and time-independent content latent factors in two different VAE based video prediction models. The key motivation behind cycle consistency loss is that future frame predictions are more plausible and realistic if they reconstruct the previous frames. The proposed cycle consistency loss is also generic because it can be applied to other VAE-based stochastic video prediction architectures with slight architectural modifications. We validate our disentanglement hypothesis and the quality of long-range predictions on standard synthetic and challenging real-world datasets such as Stochastic Moving MNIST and BAIR.