DMNet: A Dense Multiscale Feature Extraction Network With Two-Stage Training for Infrared-Visible Image Fusion

Published: 2025, Last Modified: 04 Nov 2025IEEE Internet Things J. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the increasing need for intelligent and secure multimedia systems, infrared and visible image fusion (IVIF) has garnered a lot of attention due to its ability to overcome the limitations of a single sensor and integrate unique information from different modalities. However, it is common to overlook how the spatial frequency information of visible and infrared images differs. A less thorough feature extraction may result from many approaches’ inability to reconcile the extraction of both global and local information. To solve the aforementioned difficulties, we propose a dense multiscale fusion network DMNet. Through a dual-stream collaborative feature decoupling, the proposed network optimizes both the encoder–decoder network and the diffusion model to extract multimodal information more comprehensively. Specifically, the three-stage progressive encoder sequentially integrates dense transformer block (DTB) and dense invertible neural network block (DIB) to achieve global feature extraction and multimodal feature decoupling. Our proposed channel and spatial attention block (CSAB) selectively focuses on the important feature maps to better capture the critical information. Additionally, multiscale latent features are extracted by the diffusion module (DM) to enhance the representation of cross-modal latent features. As demonstrated by extensive experiments, DMNet outperforms representative state-of-the-art methods. Furthermore, we conduct sufficient ablation experiments to validate each module’s effectiveness, and we demonstrate that DMNet can enhance downstream infrared-visible object detection performance. Our fused results and code will be accessible at https://github.com/Pancy9476/DMNet.
Loading