OV-VIS: Open-Vocabulary Video Instance Segmentation

Published: 01 Jan 2024, Last Modified: 13 Nov 2024Int. J. Comput. Vis. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Conventionally, the goal of Video Instance Segmentation (VIS) is to segment and categorize objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation (OV-VIS), which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark OV-VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1196 diverse categories, significantly surpassing the category size of existing datasets by more than an order of magnitude. Third, we propose a transformer-based OV-VIS model, OV2Seg+, which associates per-frame segmentation masks with a memory-induced transformer and clarifies objects in videos with a voting module given language guidance. In addition, to monitor the progress, we set up the evaluation protocols for OV-VIS and propose a set of strong baseline models to facilitate future endeavors. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg+. The dataset and code are released here https://github.com/haochenheheda/LVVIS. The competition website is provided here https://www.codabench.org/competitions/1748.
Loading