RsMmFormer: Multimodal Transformer Using Multiscale Self-attention for Remote Sensing Image Classification
Abstract: Remote Sensing (RS) has been widely utilized in various Earth Observation (EO) missions, including land cover classification and environmental monitoring. Unlike computer vision tasks on natural images, collecting remote sensing data is more challenging. To fully exploit the available data and leverage the complementary information across different data sources, we propose a novel approach called Multimodal Transformer for Remote Sensing (RsMmFormer) for image classification, which utilizes both Hyperspectral Image (HSI) and Light Detection and Ranging (LiDAR) data. In contrast to the conventional Vision Transformer (ViT), which does not incorporate the inherent biases and assumptions of convolutions, we improve our RsMmFormer model by incorporating convolutional layers. This allows us to integrate the favorable characteristics of convolutional neural networks (CNNs). Next, we introduce the Multi-scale Multi-head Self-Attention (MSMHSA) module, which enables learning detailed representations, facilitating the detection of small targets occupying only a few pixels in the remote sensing image. The proposed MSMHSA module facilitates the integration of Hyperspectral Imaging (HSI) and LiDAR data in a progressive and detailed manner, effectively attending to both global and local contexts using self-attention mechanisms. Comprehensive experiments conducted on popular benchmarks such as Trento and MUUFL showcase the effectiveness and superiority of our proposed RsMmFormer model for remote sensing image classification.
Loading