Abstract: Small-object detection in uncrewed aerial vehicle (UAV) and remote-sensing imagery remains challenging because targets occupy only a few pixels, features are sparse, objects are often occluded, and backgrounds are complex. To tackle these issues, we propose mamba vision you only look once (MV-YOLO), a lightweight, one-stage detector built from three complementary modules. First, the mamba vision module (MVM) leverages a state-space model to capture global contextual dependencies across the feature map with linear complexity, while its variant [mamba self-attention vision module (MSAVM)] further incorporates multihead self-attention (MHSA) to enhance long-range interactions. Second, the bio-inspired hierarchical feature modulation (BHFM) module decomposes large convolutional kernels into asymmetric and dilated convolutions—expanding the receptive field to $19\times 19$ —alongside parallel spatial and channel attention branches to retain high-frequency details crucial for small objects. Third, the context clue guided module (CCGM) learns sampling offsets to dynamically upsample low-resolution feature maps and fuses them with high-resolution features, sharpening small-object representations. Extensive experiments on VisDrone2019, DIOR, and UCAS-AOD achieve mAP50 scores of 50.6%, 89.7%, and 97.6%, respectively, demonstrating MV-YOLO’s efficiency and effectiveness for real-world aerial and remote-sensing applications.
External IDs:dblp:journals/tgrs/WuLGG25
Loading