GLNet-YOLO: Multimodal Feature Fusion for Pedestrian Detection

Yi Zhang, Qing Zhao, Xurui Xie, Yang Shen, Jinhe Ran, Shu Gui, Haiyan Zhang, Xiuhe Li, Zhen Zhang

Published: 01 Sept 2025, Last Modified: 05 Jan 2026AIEveryoneRevisionsCC BY-SA 4.0

Abstract: In the field of modern computer vision, pedestrian detection technology holds significant importance in applications such as intelligent surveillance, autonomous driving, and robot navigation. However, single-modal images struggle to achieve high-precision detection in complex environments. To address this, this study proposes a GLNet-YOLO framework based on cross-modal deep feature fusion, aiming to improve pedestrian detection performance in complex environments by fusing feature information from visible light and infrared images. By extending the YOLOv11 architecture, the framework adopts a dual-branch network structure to process visible light and infrared modal inputs, respectively, and introduces the FM module to realize global feature fusion and enhancement, as well as the DMR module to accomplish local feature separation and interaction. Experimental results show that on the LLVIP dataset, compared to the single-modal YOLOv11 baseline, our fused model improves the mAP@50 by 9.2% over the visible-light-only model and 0.7% over the infrared-only model. This significantly improves the detection accuracy under low-light and complex background conditions and enhances the robustness of the algorithm, and its effectiveness is further verified on the KAIST dataset.

External IDs:doi:10.3390/ai6090229