Towards Robust Multimodal Land Use Classification: A Convolutional Embedded Transformer

Published: 01 Jan 2025, Last Modified: 06 Nov 2025VISIGRAPP (3): VISAPP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multisource remote sensing data has gained significant attention in land use classification. However, effectively extracting both local and global features from various modalities and fusing them to leverage their complementary information remains a substantial challenge. In this paper, we address this by exploring the use of transformers for simultaneous local and global feature extraction while enabling cross-modality learning to improve the integration of complementary information from HSI and LiDAR data modalities. We propose a spatial feature enhancer module (SFEM) that efficiently captures features across spectral bands while preserving spatial integrity for downstream learning tasks. Building on this, we introduce a cross-modal convolutional transformer, which extracts both local and global features using a multi-scale convolutional embedded encoder (MSCE). The convolutional layers embedded in the encoder facilitate the blending of local and global features. Additionally, cros
Loading