Click & Describe: Multimodal Grounding and Tracking for Aerial Objects

Published: 28 Oct 2024, Last Modified: 14 Jan 2025Video-Langauge Models PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Short Paper Track (up to 3 pages)
Keywords: aerial grounding and tracking, multimodal, aerial dataset
Abstract: The fusion of multiple modalities, such as vision and language, has led to significant progress in grounding and tracking tasks. However, this success has not yet translated to aerial single-object tracking (SOT) due to the lack of text annotations in existing aerial SOT datasets. To overcome this limitation, we provide text annotations for five existing aerial datasets, designed to support and promote multi-modal research in the aerial tracking domain. Furthermore, to address challenges such as small object dimensions, similar-looking objects, and target size fluctuations, we introduce a third input modality: click (or point prompt). To offer a user-friendly and interactive alternative to precise bounding box annotations, we seamlessly integrate click and language information in the model's input. This enables approximate target specification with reduced effort and time. We introduce CLaVi, a novel multimodal framework that redefines input interaction by incorporating multiple modalities. This integration improves target localization and tracking efficiency, providing a significant advancement in the way input is provided to the model. Furthermore, we conduct experiments on the five datasets, to provide AerTrack-460 benchmark, to validate the effectiveness of our approach. AerTrack-460 benchmark shows competitive performance and, in some cases, outperforms previous language-based grounding and tracking techniques, setting a strong baseline for future research.
Submission Number: 47
Loading