Abstract: Although the convolutional neural network (CNN)-based and vision transformer (ViT)-based methods have achieved effective remote sensing scene classification results in the past few years, the CNN's inductive bias and the ViT's single spatial scale limit the further improvement of accuracy. To address these problems, in this paper, we proposed a novel multi-scale vision transformer (MS-ViT) for remote sensing scene image classification, consists of two different spatial scale information streams that extract spatial multi-scale information, a hybrid attention module that fuses the information and a classifier to predict the scene category. We evaluated the effectiveness of our proposed method on two different remote sensing datasets, namely NWPU-RESISC45 and AID. The experimental results also show that our method outperforms CNN-based methods and the original ViT-based method in performance.
0 Replies
Loading