Abstract: The rapid advancements in remote sensing technology have enabled the widespread availability of fine-resolution remote sensing images (RSIs), offering rich spatial details and semantics. Despite the applicability and scalability of transformers in semantic segmentation of RSIs by learning pairwise contextual affinity, they inevitably introduce irrelevant context, hindering accurate inference of patch semantics. To address this, we propose a novel multihead attention-attended module (AAM) that refines the multihead self-attention mechanism (AM). The AAM filters out irrelevant context while highlighting informative ones by considering the relevance between self-attention maps and the query vector. The AAM generates an attention gate to complement contextual affinity and emphasize the useful ones with a higher weight simultaneously. Leveraging multihead AAM as the core unit, we construct a lightweight attention-attended transformer block (ATB). Subsequently, we devise AAFormer, a pure transformer with a mask transformer decoder, for achieving semantic segmentation of RSIs. We extensively evaluate our approach on the ISPRS Potsdam and LoveDA datasets, demonstrating compelling performance compared to mainstream methods. Additionally, we conduct evaluations to analyze the effects of AAM.
Loading