Keywords: Aerial Video, Spatial Video Grounding, Multi modality Spatio-temporal Interaction, Hierarchical Progressive Decoder
Abstract: The task of localizing an object's spatial tube based on language instructions and video, known as spatial video grounding (SVG), has attracted widespread interest. Existing SVG tasks have focused on ego-centric fixed front perspective and simple scenes, which only involved a very limited view and environment. However, UAV-based SVG remains underexplored, which neglects the inherent disparities in drone movement and the complexity of aerial object localization. To facilitate research in this field, we introduce the novel spatial aerial video grounding (SAVG) task. Specifically, we meticulously construct a large-scale benchmark, UAV-SVG, which contains over 2 million frames and offers 216 highly diverse target categories. To address the disparities and challenges posed by complex aerial environments, we propose a new end-to-end transformer architecture, coined SAVG-DETR. The innovations are three-fold. 1) To overcome the computational explosion of self-attention when introducing multi-scale features, our encoder efficiently decouples the multi-modality and multi-scale spatio-temporal modeling into intra-scale multi-modality interaction and cross-scale visual-only fusion. 2) To enhance small object grounding ability, we propose the language modulation module to integrate multi-scale information into language features and the multi-level progressive spatial decoder to decode from high to low level. The decoding stage for the lower-level vision-language features is gradually increased. 3) To improve the prediction consistency across frames, we design the decoding paradigm based on offset generation. At each decoding stage, we utilize reference anchors to constrict the grounding region, use context-rich object queries to predict offsets, and update reference anchors for the next stage. From coarse to fine, our SAVG-DETR gradually bridges the modality gap and iteratively refines reference anchors of the referred object, eventually grounding the spatial tube. Extensive experiments demonstrate that our SAVG-DETR significantly outperforms existing state-of-the-art methods. The dataset and code will be available at here.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Flagged For Ethics Review: true
Submission Number: 546
Loading