Abstract: The ability to perceive scenes in terms of abstract entities is crucial for us to
achieve higher-level intelligence. Recently, several methods have been proposed
to learn object-centric representations of scenes with multiple objects, yet most
of which focus on static scenes. In this paper, we work on object dynamics and
propose Object Dynamics Distillation Network (ODDN), a framework that distillates explicit object dynamics (e.g., velocity) from sequential static representations. ODDN also builds a relation module to model object interactions. We verify
our approach on tasks of video reasoning and video prediction, which are two important evaluations for video understanding. The results show that the reasoning
model with visual representations of ODDN performs better in answering reasoning questions around physical events in a video compared to the previous state-of-the-art methods. The distilled object dynamics also could be used to predict
future video frames given two input frames, involving occlusion and objects collision. In addition, our architecture brings better segmentation quality and higher
reconstruction accuracy.
18 Replies
Loading