Multi-Scale Sparse Transformer for Remote Sensing Scene Classification

Published: 01 Jan 2024, Last Modified: 11 May 2025IGARSS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision Transformer (ViT) has achieved great success in the field of computer vision since it was proposed, and there have been many works applying ViT based models to remote sensing scene classification (RSSC) tasks. The proposal of Pyramid Vision Transformer (PVT) greatly reduces the calculation amount of the ViT while maintaining accuracy. But PVT did not utilize multi-scale information in remote sensing (RS) scenes, which is crucial for RSSC. This paper proposes a multi-scale sparse transformer (MST) based on PVT. MST enables the network to learn multi-scale representations of RS scenes through spatial reduction implementations at different scales. In addition, we employ sparse operations to adaptively guide the model’s attention towards semantically relevant regions during self-attention computation, thereby reducing interference from semantically irrelevant areas. Experiments conducted on the UCM and AID datasets demonstrate the outstanding performance of the proposed MST.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview