Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Mountains of researches center around the Remote Sensing Image-Text Retrieval (RSITR), aiming at retrieving the corresponding targets based on the given query. Among them, the transfer of Foundation Models (FMs), such as CLIP, to remote sensing domain shows promising results. However, existing FM-based approaches neglect the negative impact of weakly correlated sample pairs and the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs. To address these challenges, we propose a novel Eliminate Before Align strategy with Keyword Explicit Reasoning framework (EBAKER) for RSITR. Specifically, we devise an innovative Eliminate Before Align (EBA) strategy to filter out the weakly correlated sample pairs to mitigate their deviations from optimal embedding space during alignment. Moreover, we introduce a Keyword Explicit Reasoning (KER) module to facilitate the positive role of subtle key concept differences. Without bells and whistles, our method achieves a one-step transformation from FM to RSITR task, obviating the necessity for extra pretraining on remote sensing data. Extensive experiments on three popular benchmark datasets validate that our proposed EBAKER method outperform the state-of-the-art methods with fewer training data. Our source code will be released soon.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion, [Engagement] Multimedia Search and Recommendation
Relevance To Conference: Our work revolves around Remote Sensing Image-Text Retrieval, focusing primarily on the transfer of multimodal foundation models to downstream tasks. We propose a novel Eliminate Before Align strategy with Keyword Explicit Reasoning (EBAKER) framework. By capturing the unique characteristics of the remote sensing domain, Eliminate Before Align strategy is designed to filter out the weakly correlated sample pairs to mitigate their deviations from optimal embedding space during alignment. Moreover, a Keyword Explicit Reasoning module is proposed to facilitate the positive role of subtle key concept differences. Ultimately, we achieve a direct transfer of the foundation model (FM) to the remote sensing image text retrieval model in a one-step training process, skipping the stage of RS pretraining. Benefiting from above, our proposed EBAKER method achieves superior performance with fewer training data, compared with the state-of-the-art methods. In summary, our work offers a promising solution not only for directly transferring FM to various tasks within the remote sensing domain, but also for expanding the applicability of our method to other domains like product search and pedestrian retrieval, paving the way for future advancements.
Submission Number: 3187
Loading