Abstract: Benefiting from the powerful feature extraction and feature correlation modeling capabilities of convolutional neural networks (CNNs) and Transformer models, these techniques have been widely used in unmanned aerial vehicle (UAV) aerial image semantic segmentation tasks. However, the ground objects in aerial images contain feature information with different scales, and existing methods directly cascade low-level visual features and high-level semantic features without processing, resulting in low semantic segmentation precision. To address these challenges, we propose a dual-encoder cross-scale attention network, which efficiently extracts local and global context information from aerial images and performs fine-grained fusion of multiscale features to improve semantic segmentation performance. First, we introduce the dual-CNN-Transformer encoder, which embeds the scan-focus window Transformer (SFWT) into CNNs as an auxiliary encoder to supplement the local feature information lost in the global context information extraction process. Second, the cross-scale lightweight integration (CSLI) module is designed, which uses a light dot-product attention mechanism (DPAM) to fusion multiscale features and reduce model calculation parameters. Finally, the linear multilayer perceptron (LMLP) is used to restore the feature map resolution while expanding the deconvolution receptive field. To validate the effectiveness of the proposed method, we conducted extensive experiments on real aerial scene datasets, including UAVid, Urban Drone, and AeroScapes. The experimental results show that our method achieves state-of-the-art performance while maintaining superior real-time efficiency. Implementation codes will be available at https://github.com/darkseid-arch/UAVSeg.
Loading