- TL;DR: We introduce a new spatio-temporal embedding loss on videos that generates temporally consistent video instance segmentation, even with occlusions and missed detections, using appearance, geometry, and temporal context.
- Abstract: Understanding object motion is one of the core problems in computer vision. It requires segmenting and tracking objects over time. Significant progress has been made in instance segmentation, but such models cannot track objects, and more crucially, they are unable to reason in both 3D space and time. We propose a new spatio-temporal embedding loss on videos that generates temporally consistent video instance segmentation. Our model includes a temporal network that learns to model temporal context and motion, which is essential to produce smooth embeddings over time. Further, our model also estimates monocular depth, with a self-supervised loss, as the relative distance to an object effectively constrains where it can be next, ensuring a time-consistent embedding. Finally, we show that our model can accurately track and segment instances, even with occlusions and missed detections, advancing the state-of-the-art on the KITTI Multi-Object and Tracking Dataset.
- Code: https://github.com/iclr-2020-embedding/spatio-temporal-embedding
- Keywords: computer, vision, video, instance, segmentation, metric, learning