Abstract: Recent Transformer-based offline Video Instance Segmentation (VIS) studies have shown that localizing the information in Transformer layers is more effective than attending to the entire spatio-temporal feature volume. From this observation, we hypothesize that explicit use of object-oriented information on spatial scenes can be a strong solution for understanding the context of the entire sequence. Thus, we introduce a new paradigm for offline VIS that learns to integrate decoded object queries from independent frames. Specifically, we propose a simple module that can be easily built on top of an off-the-shelf Transformer-based image instance segmentation model. Leaving the frame-level model to distill the rich knowledge of the spatial scene into its object queries, the proposed module directly associates and identifies the given potential objects by building temporal interactions in between. With a Swin-L backbone, our proposed method sets a record of 50.7 AP which ranks the 3rd place in Track 2-Video Instance Segmentation of the 4th Large-scale Video Object Segmentation Challenge.
0 Replies
Loading