GLATrack: Global and Local Awareness for Open-Vocabulary Multiple Object Tracking

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Open-vocabulary multi-object tracking (MOT) aims to track arbitrary objects encountered in the real world beyond the training set. However, recent methods rely solely on instance-level detection and association of novel objects, which may not consider the valuable fine-grained semantic representations of the targets within key and reference frames. In this paper, we propose a Global and Local Awareness open-vocabulary MOT method (GLATrack), which learns to tackle the task of real-world MOT from both global and instance-level perspectives. Specifically, we introduce a region-aware feature enhancement module to refine global knowledge for complementing local target information, which enhances semantic representation and bridges the distribution gap between the image feature map and the pooled regional features. We propose a bidirectional semantic complementarity strategy to mitigate semantic misalignment arising from missing target information in key frames, which dynamically selects valuable information within reference frames to enrich object representation during the knowledge distillation process. Furthermore, we introduce an appearance richness measurement module to provide appropriate representations for targets with different appearances. The proposed method gains an improvement of 6.9% in TETA and 5.6% in mAP on the large-scale TAO benchmark.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Multiple object tracking (MOT) is a fundamental computer vision task that aims to track and analyze the movements of multiple objects in video sequences. ACM MM (ACM International Conference on Multimedia) serves as a premier venue for researchers to present their advancements in multimedia analysis, including MOT algorithms and applications. ACM MM provides a platform for researchers to exchange ideas, present state-of-the-art methods, and foster collaborations, thereby contributing to the progress of MOT research and its integration into real-world multimedia applications.
Submission Number: 4517
Loading