Click&Describe: Multimodal Grounding and Tracking for Aerial Objects

Published: 01 Jan 2025, Last Modified: 06 Nov 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The fusion of multiple modalities, such as vision and language, has led to significant progress in grounding and tracking tasks. However, this success has not yet translated to aerial single-object tracking (SOT) due to the lack of text annotations in existing aerial SOT datasets. To overcome this limitation, we provide text annotations for five existing aerial datasets, designed to support and promote multi-modal research in the aerial tracking domain. Furthermore, to address challenges such as small object dimensions, similar-looking objects, and target size fluctuations, we introduce a third input modality: click (or point prompt). We seamlessly integrate click and language information in the model's input to offer a user-friendly and interactive alternative to precise bounding box annotations. This enables approximate target specification with reduced effort and time. We introduce CLaVi, a novel multimodal framework that redefines input interaction by in-corporating multiple modalities. This integration improves target localization and tracking efficiency, providing a significant advancement in the way input is provided to the model. Furthermore, we conduct experiments on the five datasets, to provide AerTrack-460 benchmark, to validate the effectiveness of our approach. AerTrack-460 benchmark shows competitive performance and, in some cases, outperforms previous language-based grounding and tracking techniques, setting a strong baseline for future research. Code and data will be made available soon.
Loading