CLIP-Based Grid Features and Masking for Remote Sensing Image Captioning

Published: 01 Jan 2025, Last Modified: 25 Jan 2025IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Remote sensing image (RSI) captioning is a vision-language multimodal task that aims to describe image content in natural language, facilitating accurate and convenient comprehension of RSIs. Existing methods primarily focus on extracting visual features using vision-task pretraining models, such as ResNet pretrained on the ImageNet, which may not be optimal for the vision-language task. In addition, there has been limited emphasis on text preprocessing, leading to missed opportunities to explore potential relationships among words within sentences. In this article, we propose a transformer-based model utilizing CLIP visual grid features and a random masking strategy for the RSI captioning task. To enhance RSI representations, we utilize the visual encoder of CLIP, a vision-language pretraining model, to directly extract visual grid features from RSIs. Subsequently, all training sentences undergo preprocessing via a random masking strategy to impart self-supervised text-learning capabilities to the model during the training stage. Extensive experiments conducted on the RSICD, UCM-Captions, and Sydney-Captions datasets demonstrate the superior performance of our method.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview