Abstract: Multisource remote sensing data has gained significant attention in land use classification. However, effectively extracting both local and global features from various modalities and fusing them to leverage their complementary information remains a substantial challenge. In this paper, we address this by exploring the use of transformers for simultaneous local and global feature extraction while enabling cross-modality learning to improve the integration of complementary information from HSI and LiDAR data modalities. We propose a spatial feature enhancer module (SFEM) that efficiently captures features across spectral bands while preserving spatial integrity for downstream learning tasks. Building on this, we introduce a cross-modal convolutional transformer, which extracts both local and global features using a multi-scale convolutional embedded encoder (MSCE). The convolutional layers embedded in the encoder facilitate the blending of local and global features. Additionally, cros
External IDs:dblp:conf/visigrapp/RehmanIUBJ25
Loading