PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery

Jiaojiao Li, Penghao Tian, Rui Song, Haitao Xu, Yunsong Li, Qian Du

Published: 2024, Last Modified: 13 Nov 2024IEEE Trans. Geosci. Remote. Sens. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Remote-sensing object detection (RSOD) is a fundamental and valuable task in Earth monitoring. However, remote-sensing images (RSIs) are typically acquired from a bird’s eye perspective, resulting in intrinsic properties such as complex backgrounds, random and dense distribution of objects, and multiscale objects. These properties hinder the direct application of well-performed detection methods in the natural images (NIs) domain to the RSIs domain, thereby limiting the attainment of desired performance. To address this, we propose a pyramid convolutional vision transformer (PCViT) that gets rid of the limitations of existing transformer methods. First, we employ a pyramid architecture to effectively capture the multiscale information present in RSIs. To enhance the feature extraction capabilities of the transformer, we introduce a parallel convolution module (PCM) that complements the local information that may be missed by the transformer. Furthermore, we propose a self-supervised pretraining strategy called multiperspective pretraining (MPP) to pretrain the model and subsequently finetune it on the downstream detection task. During the finetuning stage, we introduce a local/global ${k}$ -NN attention (LGKA) to improve the token relationship establishment. In the neck part, we propose a feature-reflowing pyramid network (FRPN) to facilitate contextual information interaction and further enhance our PCViT’s ability to process multiscale information. Experimental results on two representative datasets, namely NWPU VHR-10 and DIOR, demonstrate the effectiveness of our PCViT, as it achieves outstanding performance. These results highlight the suitability of PCViT for RSOD tasks.