Swin-MSP: A Shifted Windows Masked Spectral Pretraining Model for Hyperspectral Image Classification
Abstract: Deep learning has found widespread application in the hyperspectral image (HSI) classification, where transformer architectures based on self-attention have emerged as state-of-the-art (SOTA). The Swin-MAE framework utilizes a masked autoencoder approach with a shifted windows transformer as its backbone, demonstrating strong representational power and performance. This study proposes a shifted windows masking spectral pretraining (Swin-MSP) model, which achieves hierarchical modeling of hyperspectral data from local to global scales by introducing spectral masking pretraining techniques and a hierarchical architecture. To fit with this pretraining, we introduce the uniaxial continuous cross correlation layer (UC3L), a straightforward yet effective solution tailored for hyperspectral imagery masking. We design the shift frequency band transformer (SFBT) to hierarchically characterize spectral features. Experiments with publicly available datasets establish that our pretrained network significantly improves classification efficiency compared with SOTA networks. Furthermore, we systematically investigate the sensitivity of various datasets to pretraining hyper-parameters. The results underscore that the universal spectral representation acquired during the pretraining phase serves as a robust initialization for subsequent task-specific fine-tuning. It is noted that this work breaks from traditional vision transformer (ViT) approaches, offering a new perspective on hyperspectral dataset pretraining. The code is available at https://github.com/teaRRe/Swin-MSP .
Loading