Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

Wonkwang Lee; Whie Jung; Han Zhang; Ting Chen; Jing Yu Koh; Thomas Huang; Hyungsuk Yoon; Honglak Lee; Seunghoon Hong

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

Wonkwang Lee, Whie Jung, Han Zhang, Ting Chen, Jing Yu Koh, Thomas Huang, Hyungsuk Yoon, Honglak Lee, Seunghoon Hong

Published: 12 Jan 2021, Last Modified: 26 May 2025ICLR 2021 PosterReaders: Everyone

Keywords: Video prediction, generative model, long-term prediction

Abstract: Learning to predict the long-term future of video frames is notoriously challenging due to the inherent ambiguities in a distant future and dramatic amplification of prediction error over time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to destruction in structure and content. In this work, we revisit the hierarchical models in video prediction. Our method generates future frames by first estimating a sequence of dense semantic structures and subsequently translating the estimated structures to pixels by video-to-video translation model. Despite the simplicity, we show that modeling structures and their dynamics in categorical structure space with stochastic sequential estimator leads to surprisingly successful long-term prediction. We evaluate our method on two challenging video prediction scenarios, \emph{car driving} and \emph{human dancing}, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon (\ie~thousands frames), setting a new standard of video prediction with orders of magnitude longer prediction time than existing approaches. Video results are available at https://1konny.github.io/HVP/.

One-sentence Summary: We propose a simple yet effective hierarchical video prediction model that can synthesize future frames orders of magnitude longer than existing methods (thousands frames)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Code: [![github](/images/github_icon.svg) 1Konny/HierarchicalVideoPrediction](https://github.com/1Konny/HierarchicalVideoPrediction)

Data: [Cityscapes](https://paperswithcode.com/dataset/cityscapes)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 6 code implementations](https://www.catalyzex.com/paper/revisiting-hierarchical-approach-for/code)

10 Replies

Loading