HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones

Hao Ruan, Jinliang Lin, Yingxin Lai, Zhiming Luo, Shaozi Li

Published: 27 Oct 2025, Last Modified: 06 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Natural Language-Guided Drones (NLGD) offer a novel and flexible interaction paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantic relationships inherent in drone scenarios place greater demands on visual language understanding. First, mainstream Vision-Language Models (VLMs) primarily focus on global feature alignment and lack fine-grained semantic understanding. Second, existing hierarchical semantic modeling methods rely on precise entity partitioning and strict containment relationship constraints, which limits their effectiveness in complex drone environments. To address these challenges, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework, comprising two core components: 1) Region-Global Image-Text Contrastive Learning (RG-ITC). Avoiding precise scene entity partitioning, RG-ITC models hierarchical local-to-global cross-modal semantics by contrasting local visual regions with global text semantics, and vice versa. 2) Region-Global Image-Text Matching Learning (RG-ITM). Instead of relying on strict relationship constraints, this component evaluates local semantic consistency within global cross-modal representations, improving the comprehension of complex compositional semantics. Furthermore, drone scenario textual descriptions are often incomplete or ambiguous, destabilizing global semantic alignment. To mitigate this, HCCM incorporates a Momentum Contrast and Momentum Distillation (MCD) mechanism, enhancing alignment robustness. Extensive experiments on the GeoText-1652 benchmark demonstrate HCCM significantly outperforms existing methods, achieving state-of-the-art Recall@1 scores of 28.8% (image retrieval) and 14.7% (text retrieval). Moreover, HCCM exhibits strong zero-shot generalization on the unseen ERA dataset, achieving 39.93% mean recall (mR), surpassing evaluated fine-tuned models. These results highlight the effectiveness and robustness of HCCM across diverse scenarios. Our implementation is available at https://github.com/rhao-hur/HCCM.

External IDs:doi:10.1145/3746027.3755489