CLIP-Based Grid Features and Masking for Remote Sensing Image Captioning

Qiaoling Lin; Shuang Wang; Xiutiao Ye; Ruixuan Wang; Rui Yang; Licheng Jiao

CLIP-Based Grid Features and Masking for Remote Sensing Image Captioning

Qiaoling Lin, Shuang Wang, Xiutiao Ye, Ruixuan Wang, Rui Yang, Licheng Jiao

Published: 01 Jan 2025, Last Modified: 25 Jan 2025IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Remote sensing image (RSI) captioning is a vision-language multimodal task that aims to describe image content in natural language, facilitating accurate and convenient comprehension of RSIs. Existing methods primarily focus on extracting visual features using vision-task pretraining models, such as ResNet pretrained on the ImageNet, which may not be optimal for the vision-language task. In addition, there has been limited emphasis on text preprocessing, leading to missed opportunities to explore potential relationships among words within sentences. In this article, we propose a transformer-based model utilizing CLIP visual grid features and a random masking strategy for the RSI captioning task. To enhance RSI representations, we utilize the visual encoder of CLIP, a vision-language pretraining model, to directly extract visual grid features from RSIs. Subsequently, all training sentences undergo preprocessing via a random masking strategy to impart self-supervised text-learning capabilities to the model during the training stage. Extensive experiments conducted on the RSICD, UCM-Captions, and Sydney-Captions datasets demonstrate the superior performance of our method.

Loading