Thread the Needle: Cues-Driven Multiassociation for Remote Sensing Cross-Modal Retrieval

Published: 01 Jan 2024, Last Modified: 30 Jul 2025IEEE Trans. Geosci. Remote. Sens. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Rapid advances in Earth observation technologies have yielded numerous remotely sensed images and corresponding text data, enabling cross-modal image–text retrieval to extract valuable clues. However, current methods often focus on learning global semantic information from text and remote sensing (RS) images, while neglecting fine-grained semantic alignment and correlation. In addition, contrastive learning between modalities is often insufficient. To address these issues, we propose an innovative cues-driven multiassociation feature matching network (CDMAN) for cross-modal RS image retrieval. The proposed method primarily involves two key steps: 1) aligning positive samples and enhancing fusion for negative samples based on modal cues. To achieve precise alignment between RS images and text and facilitate the learning process for negative samples in contrastive learning, we have developed a novel fine-grained cues injection module that aligns and guides modalities using fine-grained cues; and 2) establishing multigranularity associative learning. To address the issue of insufficient association between RS images and text, we have implemented multigranularity collaborative associative learning, focusing on general and fine-grained modal associations. By fully leveraging modal cues, our method maintains both detailed associations and overall consistency in global associations. Experiments demonstrate that, compared to baseline methods, this approach achieves more accurate cross-modal retrieval (MCR) by combining fine-grained alignment and multigranularity associations.
Loading