Keywords: few-shot video segmentation, video object segmentation, few-shot learning, actor/action segmentation
Abstract: Learning to compare support and query feature sets for few-shot image and video understanding has been shown to be a powerful approach. Typically, methods limit feature comparisons to a single feature layer and thus ignore potentially valuable information. In particular, comparators that operate with early network layer features support precise localization, but lack sufficient semantic abstraction. At the other extreme, operating with deeper layer features provide richer descriptors, but sacrifice localization. In this paper, we address this scale selection challenge with a meta-learned Multiscale Multigrid Comparator (MMC) transformer that combines information across scales. The multiscale, multigrid operations encompassed by our architecture provide bidirectional information transfer between deep and shallow features (i.e. coarse-to-fine and fine-to-coarse). Thus, the overall comparisons among query and support features benefit from both rich semantics and precise localization. Additionally, we present a novel multiscale memory learning in the decoder within a meta-learning framework. This augmented memory preserves the detailed feature maps during the information exchange across scales and reduces confusion among the background and novel class. To demonstrate the efficacy of our approach, we consider two related tasks, few-shot video object and actor/action segmentation. Empirically, our model outperforms state-of-the-art approaches on both tasks.
TL;DR: We propose the first multiscale multigrid comparator transformer for few-shot video dense prediction tasks
Supplementary Material: zip
14 Replies
Loading