Visual Contextual Semantic Reasoning for Cross-Modal Drone Image-Text Retrieval

Published: 01 Jan 2024, Last Modified: 30 Jul 2025IEEE Trans. Geosci. Remote. Sens. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The cross-modal drone image-text (DIT) retrieval task involves using either text or drone images as queries to retrieve relevant drone images or corresponding text. The primary challenge stems from the diverse and intricate nature of drone images, making effective alignment between image and text challenging. In response, we propose an innovative approach called visual contextual semantic reasoning (VCSR), aimed at precisely aligning information across different modalities. VCSR employs textual cues to guide rich semantic reasoning within the visual context, reducing redundancy in visual information. Furthermore, the method captures drone image information relevant to the text, revealing subtle correspondences between drone image regions and textual content. To enhance visual semantic learning, context region learning (CRL) term and consistency semantic alignment (CSA) terms are introduced for stronger guidance, further intensifying the cross-modal interaction between textual and visual data, resulting in more robust feature representation. Extensive experiments conducted on two self-constructed DIT datasets demonstrate that VCSR outperforms alternative methods in terms of DIT retrieval performance. The codes are accessible at https://github.com/huangjh98/VCSR .
Loading