Rethinking Remote Sensing CLIP: Leveraging Multimodal Large Language Models for High-Quality Vision-Language Dataset

Yiguo He, Junjie Zhu, Yiying Li, Qiangjuan Huang, Zhiyuan Wang, Ke Yang

Published: 05 Dec 2024, Last Modified: 07 May 2025OpenReview Archive Direct UploadEveryoneCC BY-SA 4.0

Abstract: The application of Contrastive Language-Image Pre-training (CLIP) models to remote sensing imagery has garnered significant attention. A key challenge lies in the scarcity of high-quality, large-scale, image-text paired training data. Recently, several works introduced extensive image-text datasets that leverage existing heterogeneous annotated datasets for remote sensing and trained their vision-language foundation models. However, due to the rudimentary methods used for creating text descriptions, the quality of datasets produced by these methods is suboptimal, requiring larger volumes of training data, while only yielding modest performance improvements. In this paper, we primarily propose the employment of Multimodal Large Language Models (MLLMs) to generate higher-quality captions. Specifically, we carefully design an Annotation to Instruction (A2I) module to bridge existing annotations for detection, segmentation, and classification tasks with the input requirements of grounding MLLMs. In addition, we propose a refined rule based text caption generation method and incorporate 8 classification datasets and 1 multispectral RGB composite image dataset to enhance the diversity of data. Finally, we have created RSM-ITD, a high-quality, large-scale remote sensing image-text dataset, containing approximately 480K image-text pairs. The experimental results suggest that, despite the smaller size of our proposed dataset, the CLIP models trained on it achieve better results than SOTA methods in tasks like zero-shot classification, retrieval, and semantic localization. Dateset, pre-trained models, and codes will be released upon publication.