MemVLT: Vision-Language Tracking with Adaptive Memory-based Prompts

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Object Tracking; Visual-Language Multimodality; Adaptive Prompts
TL;DR: Drawing from the Complementary Learning Systems theory, we propose a novel vision-language tracker (MemVLT), which can provide adaptive multimodal prompts for tracking guidance, achieving SOTA performance on 4 mainstream datasets..
Abstract:

Vision-language tracking (VLT) enhances traditional visual object tracking by integrating language descriptions, requiring the tracker to flexibly understand complex and diverse text in addition to visual information. However, most existing vision-language trackers still overly rely on initial fixed multimodal prompts, which struggle to provide effective guidance for dynamically changing targets. Fortunately, the Complementary Learning Systems (CLS) theory suggests that the human memory system can dynamically store and utilize multimodal perceptual information, thereby adapting to new scenarios. Inspired by this, (i) we propose a Memory-based Vision-Language Tracker (MemVLT). By incorporating memory modeling to adjust static prompts, our approach can provide adaptive prompts for tracking guidance. (ii) Specifically, the memory storage and memory interaction modules are designed in accordance with CLS theory. These modules facilitate the storage and flexible interaction between short-term and long-term memories, generating prompts that adapt to target variations. (iii) Finally, we conduct extensive experiments on mainstream VLT datasets (e.g., MGIT, TNL2K, LaSOT and LaSOT$_{ext}$). Experimental results show that MemVLT achieves new state-of-the-art performance. Impressively, it achieves 69.4% AUC on the MGIT and 63.3% AUC on the TNL2K, improving the existing best result by 8.4% and 4.7%, respectively.

Supplementary Material: zip
Primary Area: Machine vision
Submission Number: 6028
Loading