Abstract: Remote sensing image (RSI) captioning is a vision-language multimodal task that aims to describe image content in natural language, facilitating accurate and convenient comprehension of RSIs. Existing methods primarily focus on extracting visual features using vision-task pretraining models, such as ResNet pretrained on the ImageNet, which may not be optimal for the vision-language task. In addition, there has been limited emphasis on text preprocessing, leading to missed opportunities to explore potential relationships among words within sentences. In this article, we propose a transformer-based model utilizing CLIP visual grid features and a random masking strategy for the RSI captioning task. To enhance RSI representations, we utilize the visual encoder of CLIP, a vision-language pretraining model, to directly extract visual grid features from RSIs. Subsequently, all training sentences undergo preprocessing via a random masking strategy to impart self-supervised text-learning capabilities to the model during the training stage. Extensive experiments conducted on the RSICD, UCM-Captions, and Sydney-Captions datasets demonstrate the superior performance of our method.
Loading