OV-VOD: Open-Vocabulary Video Object Detection

Zhihong Zheng, Yang Cao, Junlong Gao, Hanzi Wang

Published: 26 Oct 2025, Last Modified: 12 Nov 2025ACMMM 2025EveryoneCC BY 4.0

Abstract: Traditional Video Object Detection (VOD) is limited by pre-defined closed-set categories, restricting its ability to detect novel objects in real-world scenarios. To address this limitation, we make three key contributions. First, we formally define Open-Vocabulary Video Object Detection (Open-Vocabulary VOD) as the task of detecting objects in video streams from open-set categories, including novel categories unseen during training. Second, we establish an evaluation benchmark by utilizing existing datasets (LV-VIS, BURST, and TAO) to bridge the data gap for this new task. Third, we propose OV-VOD, an Open-Vocabulary VOD method that detects objects in videos beyond pre-defined training categories and addresses the shortcomings of image-level open-vocabulary detectors, which generally neglect the essential temporal and spatial information. Specifically, we design a Semantic-Presence Memory Tracking (SPMT) module that propagates object features across frames through a memory bank to leverage temporal consistency. Moreover, we propose a Spatial Object Relationship Distillation loss ($\mathcal L_{SR}$) that captures inter-object spatial dependencies and enhances knowledge transfer during feature distillation. Experiments on multiple video datasets demonstrate that our OV-VOD exhibits superior zero-shot generalization capability compared to existing image-level open-vocabulary object detection methods.