Consistencies are All You Need for Semi-supervised Vision-Language Tracking

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision-Language Tracking (VLT) requires locating a specific target in video sequences, given a natural language prompt and an initial object box. Despite recent advancements, existing approaches heavily rely on expensive and time-consuming human annotations. To mitigate this limitation, directly generating pseudo labels from raw videos seems to be a straightforward solution; however, it inevitably introduces undesirable noise during the training process. Moreover, we insist that an efficient tracker should excel in tracking the target, regardless of the temporal direction. Building upon these insights, we propose the pioneering semi-supervised learning scheme for VLT task, representing a crucial step towards reducing the dependency on high-quality yet costly labeled data. Specifically, drawing inspiration from the natural attributes of a video (i.e., space, time, and semantics), our approach progressively leverages inherent consistencies from these aspects: (1) Spatially, each frame and any object cropped from it naturally form an image-bbox (bounding box) pair for self-training; (2) Temporally, bidirectional tracking trajectories should exhibit minimal differences; (3) Semantically, the correlation between visual and textual features is expected to remain consistent. Furthermore, the framework is validated with a simple yet effective tracker we devised, named ATTracker (Asymmetrical Transformer Tracker). It modifies the self-attention operation in an asymmetrical way, striving to enhance target-related features while suppressing noise. Extensive experiments confirm that our ATTracker serves as a robust baseline, outperforming fully supervised base trackers. By unveiling the potential of learning with limited annotations, this study aims to attract attention and pave the way for Semi-supervised Vision-Language Tracking (SS-VLT).
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: We propose a robust baseline for Semi-supervised Vision-Language Tracking (SS-VLT), which leverages the power of multimodal learning to enhance tracking performance. By unveiling the potential of learning with limited annotations, this study aims to attract attention and pave the way for further research.
Supplementary Material: zip
Submission Number: 1401
Loading