Aerial Image Semantic Segmentation Method Based on Cross-Modal Hierarchical Feature Fusion

Jinglei Bai, Jinfu Yang, Tao Xiang, Shu Cai

Published: 2025, Last Modified: 28 Oct 2025IEEE Geosci. Remote. Sens. Lett. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multimodal aerial image semantic segmentation enables fine-grained land cover classification by integrating data from different sensors, yet it remains challenged by information redundancy, intermodal feature discrepancies, and class confusion in complex scenes. To address these issues, we propose a cross-modal hierarchical feature fusion network (CMHFNet) based on an encoder–decoder architecture. The encoder incorporates a pixelwise attention-guided fusion module (PAFM) and a multistage progressive fusion transformer (MPFT) to suppress redundancy and model long-range intermodal dependencies and scale variations. The decoder introduces a residual information-guided feature compensation mechanism to recover spatial details and mitigate class ambiguity. The experiments on DDOS, Vaihingen, and Potsdam datasets demonstrate that the CMHFNet surpasses state-of-the-art methods, validating its effectiveness and practical value.

External IDs:dblp:journals/lgrs/BaiYXC25