FefDM-Transformer: Dual-channel multi-stage Transformer-based encoding and fusion mode for infrared-visible images

Published: 01 Jan 2025, Last Modified: 31 Jul 2025Expert Syst. Appl. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Feature extraction is generally achieved through convolutional operations in existing CNN-based image fusion methods. During feature transferring, the loss of global features is inevitable. Transformer-based methods can model the image’s long-range dependency well through self-attention. However, modeling occurs only at a specific stage for most Transformer-based image fusion methods, which inevitably results in the inadequacy of deep feature extraction. To solve this problem, this paper proposes a new dual-channel multi-stage Transformer-based encoding and fusion mode for infrared–visible images, named FefDM-Transformer. During the encoding stage, a dual-channel spatial transformer, a Channel Transformer, and a dense cross-convolution feature extraction module are designed to fully extract important global and local features of infrared and visible source images from spatial dimension, channel scale, and convolution perspective, respectively. Then, shallow-level feature fusion is achieved. During the feature fusion stage, a spatial transformer and a channel transformer are designed to improve further the global long-range dependency of the features in the spatial and cross-channel scales and to achieve deep-level feature fusion. Finally, considering the specific characteristics of infrared and visible image representation, a single adaptive structural similarity loss function with gradient maximization is designed to improve the fusion quality while avoiding the problems of model tuning and the coupling between multiple loss functions. Experiments are conducted on one grayscale and two color image datasets. According to the experimental results, the proposed FefDM-Transformer achieves the best subjective visualization and objective performance compared to existing fusion works. It also performs well in visual detection and semantic segmentation tasks, effectively facilitating downstream advanced visual tasks.
Loading