Abstract: Video prediction is a complicated task as countless possible future frames exist that are equally plausible. While recent work have made progress in the prediction and generation of future video frames, these work have not attempted to disentangle different features of videos such as an object’s structure and its dynamics. Such a disentanglement would allow one to control these aspects to some extent in the prediction phase, while at the same time maintain the object’s intrinsic properties that are learned as the model’s internal representation. In this work, we propose Ladder Variational Recurrent Neural Networks (LVRNN). We employ a type of ladder autoencoder shown to be effective for feature disentanglement on images and apply it to the Variational Recurrent Neural Network (VRNN) architecture, which has been used for video prediction. We rely on extracted keypoints in each frame to separate the structure from the visual features. We then show how different levels of the ladder network learn to disentangle features and demonstrate that each of these levels can be used for controlling different aspects of future frames such as structure and dynamics. We evaluate our method on the Human3.6M and BAIR robot datasets. We show that our method is able to perform hierarchical disentanglement, yet provide reasonable results compared to similar methods.
0 Replies
Loading