Language-Empowered Conversion for Remote Sensing Image Retrieval With Text Feedback

Jian Yang, Shengyang Li, Yuhan Sun, Han Wang, Zhuang Zhou

Published: 2025, Last Modified: 26 Jan 2026IEEE Trans. Geosci. Remote. Sens. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Remote sensing image retrieval with text feedback (RSIR-TF) presents a challenging multimodal retrieval task that leverages a reference image, modification text, and scene graph to retrieve the relevant target image from a gallery. Existing approaches rely on cross-modal combiners to integrate multimodal features extracted separately from modality-specific encoders. However, the modality-specific encoders often suffer from insufficient representational capacity and limited cross-modal alignment due to the lack of effective pretraining. Recently, vision language models (VLMs) pretrained on large-scale image-text pairs have demonstrated exceptional representation and alignment capabilities in the remote sensing (RS) domain. However, these VLMs struggle with processing structured scene graphs, limiting their applicability to tasks like RSIR-TF that require composite reasoning over structured and unstructured modalities. To address these limitations, we propose a novel pipeline, language-empowered conversion (LEmpo), which effectively migrates the large language model (LLM) and VLM to the RSIR-TF task. First, we perform pseudo caption generation (PCG) and scene graph interpretation (SGI) powered by LLM to convert structured scene graphs into natural language captions. This conversion bridges the gap between structured scene graphs and unstructured text captions, enabling unified feature extraction and alignment. Subsequently, we employ the pretrained VLM to extract robust visual and textual features within a joint visual-textual feature space. To fully utilizing the complementary information from visual, textual, and structured data, we introduce a hybrid similarity tuning strategy, which aggregates triplet similarity, language similarity, and pseudo caption similarity into a unified hybrid similarity. The hybrid similarity is optimized during training through vision-fixed tuning, which anchors visual features while refining textual features to enhance alignment with target images. Comprehensive experiments conducted on the airplane, tennis, and WHIRT datasets demonstrate that LEmpo significantly outperforms all comparison methods, achieving a substantial improvement in recall performance.

External IDs:dblp:journals/tgrs/YangLSWZ25