EMRA-proxy: Enhancing Multi-Class Region Semantic Segmentation in Remote Sensing Images with Attention Proxy

Yichun Yu, Yuqing Lan, Zhihuan Xing, Xiaoyi Yang, Tingyue Tang, Dan Yu

Published: 19 Aug 2024, Last Modified: 12 Jun 2025Proceedings of the 20th International Conference on Intelligent Computing (ICIC 2024): Poster Volume I. Tianjin, ChinaEveryoneRevisionsCC BY-ND 4.0

Abstract: Semantic segmentation is a highly challenging task in high-resolution remote sensing (HRRS) image due to the complex spatial layouts and significant appearance variations of multi-class objects. Convolutional Neural Networks (CNNs) have been widely employed as feature extractors for various visual tasks, owing to their excellent ability to extract local features. However, due to the inherent bias of convolutional operations, CNNs inevitably have limitations in modeling long-range dependencies. On the other hand, Transformers excel in capturing global representations but unfortunately overlook the details of local features and category features, and exhibit high computational and spatial complexity when dealing with high-resolution feature maps. Semantic segmentation has traditionally been modeled as predicting each point on a dense regular grid. In this work, we propose a novel and effective model, EMRA-proxy, which consists of two parts: homogeneous regions attention proxy (HRA-proxy) and Multi-class Attention proxy (MCA-proxy). The proposed EMRA-proxy model abandons the common Cartesian feature layout and operates purely at the region level. First, to capture contextual information within a region, we use Transformer to encode regions in a sequence-to-sequence manner by applying multiple layers of self-attention to region embeddings acting as proxies for specific regions. HRA-proxy then interprets the image into learnable surface subdivisions, each with flexible geometry and homogeneous semantics. It is performed by using a single linear classifier on top of the encoded region embeddings for prediction per region, thereby obtaining a homogeneous semantic mask feature map (HSMF-map). Then MCA-proxy learns the global class attention map (GCA-map) to make up for ViT's shortcomings in multi-class information extraction. Finally, HSMFmap and GCA-map are integrated to achieve high-precision multi-class remote sensing image segmentation. Extensive experiments on three public remote sensing datasets demonstrate the superiority of EMRA-proxy and indicate that the overall performance of our method outperforms state-of-the-art methods.