Self-Supervision Interactive Alignment for Remote Sensing Image-Audio Retrieval

Published: 01 Jan 2023, Last Modified: 30 Jul 2025IEEE Trans. Geosci. Remote. Sens. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Cross-modal remote sensing image–audio (RSIA) retrieval aims to use audio or remote sensing images (RSIs) as queries to retrieve relevant RSIs or corresponding audios. Although many approaches leverage labeled samples to achieve good performance, the performance cost of labeled samples is high, because cross-modal remote sensing (RS) labeled samples usually require huge labor resources. Therefore, unsupervised cross-modal learning is very important in real-world applications. In this article, we propose a novel unsupervised cross-modal RSIA retrieval approach, named self-supervision interactive alignment (SSIA), which can take advantage of large amounts of unlabeled samples to learn the salient information, cross-modal alignment, and the similarity between RSIs and audios. Since self-supervised learning lacks the supervision of label information, we leverage the similarity between the input RSI information and audio information as the supervision information. Besides, to perform cross-modal alignment, a novel interactive alignment (IA) module is designed to explore fine correspondence relation for RSIs and audios. Moreover, we design an audio-guided image de-redundant module to reduce the redundant information of visual information, which can capture salient information of RSIs. Extensive experiments on four widely used RSIA datasets testify that the SSIA performance gains better RSIA retrieval performance than other compared approaches.
Loading