SRCFormer: Spatial-Spectral Residual Cross-Attention Transformer for Multimodal Remote Sensing Data Classification
Abstract: With the advancement of remote sensing technology, more and more modalities are becoming available for land cover classification tasks, helping to address the issue of insufficiency and incompleteness caused by modeling on single-source remote sensing images. However, most existing remote sensing data classifiers are struggling to capture reliable and informative spatial and spectral dependencies and neglect the correlations and complementarity between different modalities. To address the challenges, we propose the spatial-spectral residual cross-attention transformer ($\mathbf {S{^{2}}}$RCFormer) for multimodal remote sensing data classification (MRSDC). It mainly consists of a patchwise convolutional module (PTConv), pixelwise convolutional module (PXConv), residual cross-attention tokenization module (RCTM), and transformer feature fusion module (TFFM). To make full use of multimodal cues, PTConv extracts patchwise spatial-spectral and spatial feature from HSI and other modality, respectively, while PXConv attempts to exploit detailed spectral feature from HSI and unique pixelwise feature from other modality. Afterward, RCTM is employed to leverage 1-D residual cross-attention for adaptively fusing diverse features across different modalities in a tokenization fashion. As for TFFM, it establishes long-range dependencies on the set of heterogeneous tokens followed by a multiple layer perceptron (MLP) for predicting the output category. To verify the effectiveness of the proposed method, extensive experiments are conducted on three benchmark datasets (Trento, MUUFL, Augsburg) using four different modality combinations. The results indicate that the proposed approach shows comparable results to other state-of-the-art methods over different metrics.
External IDs:dblp:journals/staeors/XuCLLLZWRD25
Loading