Unsupervised Hierarchical Video Prediction


Nov 03, 2017 (modified: Nov 03, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Long term video prediction is a challenging machine learning problem, whose solution would enable intelligent agents to plan sophisticated interactions with their environment and assess the effect of their actions. Much recent research has been devoted to video prediction and generation, but mostly for short-scale time horizons. The hierarchical video prediction method by Villegas et al. (2017) is an example of a state of the art method for long term video predictions. However,their method has limited applicability in practical settings as it requires annotation of structures (e.g., pose) at every time step. This paper presents a long term hierarchical video prediction model that doesn’t have such a restriction. We show that the network learns its own learned higher-level structure (e.g., pose equivalent hidden variables) that works better in cases where that higher-level structure doesn’t capture all of the information needed to predict the next frames. This method gives sharper results than other video prediction methods which don’t re-quire a groundtruth pose, and its efficiency is shown on the Humans 3.6M and Robot Pushing datasets
  • TL;DR: We show ways to train a hierarchical video prediction model without needing pose labels.
  • Keywords: video prediction, visual analogy network, unsupervised