Fine-Grained Visual-Language Alignment for Remote Sensing Image-Text Retrieval

Shuo Li, Haoyang Ji, Fang Liu, Licheng Jiao, Xutong Min, Xinyan Huang, Jiahao Wang, Long Sun, Lingling Li, Xu Liu

Published: 2025, Last Modified: 25 Mar 2026IEEE Trans. Geosci. Remote. Sens. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Remote sensing image–text retrieval (RSITR) is critical for applications, including environmental monitoring and disaster management. The main challenge in this field is that the multiscale feature of remote sensing images and the semantic differences of professional texts make it difficult to achieve accurate alignment. Existing coarse-grained methods struggle to address the inherent difference between images and text. In light of this, we propose the fine-grained visual-language alignment (FGVLA) method. Our FGVLA employs a hybrid loss function that combines coarse-grained contrastive and triplet loss with novel fine-grained loss. Fine-grained loss includes spatial mask loss and fine-grained contrastive loss to enhance semantic alignment. The method also introduces an inference process that works cooperatively with fine-grained loss to explicitly align image patches with textual nouns. Extensive experiments on RSICD, RSITMD, and UCM-Caption datasets demonstrate that FGVLA outperforms the existing methods, achieving superior retrieval performance. The code of our FGVLA has been released at https://github.com/Ji-Haoyang/FGVLA
Loading