VIM: Variational Independent Modules for Video PredictionDownload PDF

26 Oct 2021, 15:20 (modified: 20 Feb 2022, 21:47)CLeaR 2022 PosterReaders: Everyone
Keywords: Objects, Modularity, Unsupervised Representation Learning, Videos, Interpretability, Compositionality
TL;DR: We define an object-centric video prediction model that learns modular object dynamics and displays good compositional generalization skills
Abstract: We introduce a variational inference model called VIM, for Variational Independent Modules, for sequential data that learns and infers latent representations as a set of objects and discovers modular causal mechanisms over these objects. These mechanisms - which we call modules - are independently parametrized, define the stochastic transitions of entities and are shared across entities. At each time step, our model infers from a low-level input sequence a high-level sequence of categorical latent variables to select which transition modules to apply to which high-level object. We evaluate this model in video prediction tasks where the goal is to predict multi-modal future events given previous observations. We demonstrate empirically that VIM can model 2D visual sequences in an interpretable way and is able to identify the underlying dynamically instantiated mechanisms of the generation process. We additionally show that the learnt modules can be composed at test time to generalize to out-of-distribution observations.
11 Replies