Abstract: Vision Transformer (ViT) has achieved great success in the field of computer vision since it was proposed, and there have been many works applying ViT based models to remote sensing scene classification (RSSC) tasks. The proposal of Pyramid Vision Transformer (PVT) greatly reduces the calculation amount of the ViT while maintaining accuracy. But PVT did not utilize multi-scale information in remote sensing (RS) scenes, which is crucial for RSSC. This paper proposes a multi-scale sparse transformer (MST) based on PVT. MST enables the network to learn multi-scale representations of RS scenes through spatial reduction implementations at different scales. In addition, we employ sparse operations to adaptively guide the model’s attention towards semantically relevant regions during self-attention computation, thereby reducing interference from semantically irrelevant areas. Experiments conducted on the UCM and AID datasets demonstrate the outstanding performance of the proposed MST.
Loading