EMSNet: Efficient Multimodal Symmetric Network for Semantic Segmentation of Urban Scene From Remote Sensing Imagery
Abstract: High-resolution remote sensing imagery (RSI) plays a pivotal role in the semantic segmentation (SS) of urban scenes, particularly in urban management tasks such as building planning and traffic flow analysis. However, the dense distribution of objects and the prevalent background noise in RSI make it challenging to achieve stable and accurate results from a single view. Integrating digital surface models (DSM) can achieve high-precision SS. But this often requires extensive computational resources. It is essential to address the tradeoff between accuracy and computational cost and optimize the method for deployment on edge devices. In this article, we introduce an efficient multimodal symmetric network (EMSNet) designed to perform SS by leveraging both optical and DSM images. Unlike other multimodal methods, EMSNet adopts a dual encoder–decoder structure to build a direct connection between DSM data and the final result, making full use of the advanced DSM. Between branches, we propose a continuous feature interaction to guide the DSM branch by RGB features. Within each branch, multilevel feature fusion captures low spatial and high semantic information, improving the model's scene perception. Meanwhile, knowledge distillation (KD) further improves the performance and generalization of EMSNet. Experiments on the Potsdam and Vaihingen datasets demonstrate the superiority of our method over other baseline models. Ablation experiments validate the effectiveness of each component. Besides, the KD strategy is confirmed by comparing it with the segment anything model (SAM). It enables the proposed multimodal SS network to match SAM's performance with only one-fifth of the parameters, computation, and latency.
External IDs:dblp:journals/staeors/ZhouWSWZZ25
Loading