- Abstract: To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics , has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most these methods are restricted to the case where no occlusions occur, narrowing the potential areas of application. The main contribution of this paper is a method combining a predictor of object dynamics and a neural renderer efficiently predicting future trajectories and explicitly modelling partial and full occlusions among objects. We present a training procedure enabling learning intuitive physics directly from the input videos containing segmentation masks of objects and their depth. Our results show that our model learns object dynamics despite significant inter-object occlusions, and realistically predicts segmentation masks up to 30 frames in the future. We study model performance for increasing levels of occlusions, and compare results to previous work on the tasks of future prediction and object following. We also show results on predicting motion of objects in real videos and demonstrate significant improvements over state-of-the-art on the object permanence task in the intuitive physics benchmark of Riochet et al. (2018).
- Original Pdf: pdf