AVLTrack: Dynamic Sparse Learning for Aerial Vision-Language Tracking

Published: 01 Jan 2025, Last Modified: 05 Nov 2025IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The introduction of natural language for vision-language (VL) tracking has been proven to improve performance. However, natural language remains under-explored in existing aerial trackers. Moreover, existing VL trackers ignore the misalignment of language with dynamic target states, which is prominent in complex UAV scenarios. In this work, we present AVLTrack, a flexible framework for aerial vision-language tracking. It consists of three key components, a dynamic sparse learning (DSL) module, an efficient Transformer backbone, and a multi-level language perception (MLP) strategy. First, DSL sparsely connects language and images via dynamic sparse attention, providing accurate multi-modal prompts. To adapt to target state variations, the sparsity in DSL is dynamically adjusted based on semantic information, flexibly highlighting target-specific tokens. Next, the Transformer backbone follows highly parallelized one-stream architectures, allowing efficient multi-modal feature extraction and interaction. Finally, MLP enables the iterative interaction of language and visual information, aiming to utilize language priori to guide the generation of discriminative visual features. Moreover, we construct the DTB70-NLP dataset to facilitate UAV vision-language tracking. Extensive experiments on WebUAV-3M and DTB70-NLP demonstrate the leading performance of AVLTrack compared to existing outstanding trackers while maintaining a high running speed of 80.5 FPS. The dataset and codes are available at https://github.com/xyl-507/AVLTrack.
Loading