EMRA-proxy: Enhancing Multi-Class Region Semantic Segmentation in Remote Sensing Images with Attention Proxy
Abstract: Semantic segmentation is a highly challenging task in high-resolution
remote sensing (HRRS) image due to the complex spatial layouts and
significant appearance variations of multi-class objects. Convolutional Neural
Networks (CNNs) have been widely employed as feature extractors for various
visual tasks, owing to their excellent ability to extract local features. However,
due to the inherent bias of convolutional operations, CNNs inevitably have
limitations in modeling long-range dependencies. On the other hand,
Transformers excel in capturing global representations but unfortunately
overlook the details of local features and category features, and exhibit high
computational and spatial complexity when dealing with high-resolution feature
maps. Semantic segmentation has traditionally been modeled as predicting each
point on a dense regular grid. In this work, we propose a novel and effective
model, EMRA-proxy, which consists of two parts: homogeneous regions
attention proxy (HRA-proxy) and Multi-class Attention proxy (MCA-proxy).
The proposed EMRA-proxy model abandons the common Cartesian feature
layout and operates purely at the region level. First, to capture contextual
information within a region, we use Transformer to encode regions in a
sequence-to-sequence manner by applying multiple layers of self-attention to
region embeddings acting as proxies for specific regions. HRA-proxy then
interprets the image into learnable surface subdivisions, each with flexible
geometry and homogeneous semantics. It is performed by using a single linear
classifier on top of the encoded region embeddings for prediction per region,
thereby obtaining a homogeneous semantic mask feature map (HSMF-map).
Then MCA-proxy learns the global class attention map (GCA-map) to make up
for ViT's shortcomings in multi-class information extraction. Finally, HSMFmap and GCA-map are integrated to achieve high-precision multi-class remote
sensing image segmentation. Extensive experiments on three public remote
sensing datasets demonstrate the superiority of EMRA-proxy and indicate that
the overall performance of our method outperforms state-of-the-art methods.
Loading