- Abstract: Temporal observations such as videos contain essential information about the dynamics of the underlying scene, but they are often interleaved with inessential, predictable details. One way of dealing with this problem is by focusing on the most informative moments in a sequence. In this paper, we propose a model that learns to discover these important events and the times when they occur and uses them to represent the full sequence. We do so using a hierarchical Keyframe-Inpainter (KEYIN) model that first generates a video’s keyframes and then inpaints the rest by generating the frames at the intervening times. We propose a fully differentiable formulation to efficiently learn this procedure. We show that KEYIN finds informative keyframes in several datasets with different dynamics and visual properties. KEYIN outperforms other recent hierarchical predictive models for planning. For more details, please see the accompanying arXiv report and the project website.