Keywords: video segmentation; VOS; referring expression segmentation
TL;DR: Video segmentation with location
Abstract: Understanding objects in videos in terms of fine-grained localization masks and
detailed semantic properties is a fundamental task in video understanding. In this
paper, we propose VoCap, a flexible video model that consumes a video and a
prompt of various modalities (text, box or mask), and produces a spatio-temporal
masklet with a corresponding object-centric caption. As such our model addresses
simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation
dataset (SAV) with pseudo object captions. We do so by preprocessing videos with
their ground-truth masks to highlight the object of interest and feed this to a large
Vision Language Model (VLM). For an unbiased evaluation, we collect manual
annotations on the validation set. We call the resulting dataset SAV-Caption. We
train our VoCap model at scale on a SAV-Caption together with a mix of other
image and video datasets. Our model yields state-of-the-art results on referring
expression video object segmentation, is competitive on semi-supervised video
object segmentation, and establishes a benchmark for video object captioning. Our
dataset will be made available.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16843
Loading