Abstract: Accurate building extraction from remote sensing imagery is critical but challenged by scale variations, complex geometries, and indistinct boundaries. Existing CNNs struggle with global context, while Transformers often blur fine details, and hybrid methods may compromise edge precision or efficiency. To overcome these limitations, we design a Dual-Stream Feature Fusion Network (DSFNet). DSFNet features an Inception-inspired Spatial Information Extraction Stream (ISIES) that leverages orthogonal strip convolutions (e.g., 1x7/7x1, 1x9/9x1) to capture multi-scale local features with significantly enhanced edge delineation compared to standard convolutions. In addition, we introduce a Cross-dimensional Interaction and Multi-Scale Attention (CIMA) mechanism that refines features by explicitly modeling channel-spatial interactions with directional awareness, further improving boundary representation. These components are integrated with a Transformer stream for global context and an Adaptive Feature Fusion Module (AFFM) for effective hierarchical merging. Extensive experiments on the WHU and Inria benchmarks show state-of-the-art performance (e.g., 91.25% IoU, 95.47% F1 on WHU) with competitive efficiency (48.18M parameters, 112.02G FLOPs), which validate the efficacy of strip convolutions and cross-dimensional attention for precise building extraction.
External IDs:dblp:conf/icic/WangXZS25
Loading