Dual-branch cross-modal fusion with local-to-global learning for UAV object detection

Binyi Fang, Yixin Yang, Jingjing Chang, Ziyang Gao, Hai-Bao Chen

Published: 20 Jan 2025, Last Modified: 25 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: Due to the significant differences between unmanned aerial vehicle (UAV) images and natural scene images in terms of lighting, scale, and viewing angle, existing multispectral detection techniques often fail to fully utilize the remote dependencies between global and local information, resulting in poor performance in complex UAV scenarios. In this paper, we propose a novel two-branch cross-modal fusion network that integrates a dual cross attention transformer fusion block (CTF) for global feature dependency and an adaptive mask convolution fusion block (MCF) for underlying feature extraction. This achieves a unified representation with both global and local receptive fields. Our local-global training strategy utilizes a shallow global fusion network and a deep local fusion network, which operate on the entire image while also focusing on detailed local features. Additionally, we integrate an asymptotic feature pyramid network that employs adaptive spatial fusion to refine features, enhancing the accuracy of small object detection in UAV scenes. Evaluating our work with the DroneVehicle dataset for vehicle detection using infrared and visible light, our network outperformed existing methods, improving [email protected] by 7.01% compared to CAL-Net. APs for YOLOv8m-based single-modal infrared detection and visible light detection increased by 3.1% and 11.6%, respectively.
Loading