MAFFuse: Multi-Attention Fusion Network for Efficient and Robust Image Fusion

MAFFuse: Multi-Attention Fusion Network for Efficient and Robust Image Fusion

14 Sept 2025 (modified: 05 Nov 2025)Submitted to NLDL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: image fusion, attention mechanism, vision transformer

Abstract: Image fusion seeks to combine source images into a single, more informative image while retaining the complementary information from the original images. Existing image fusion models often achieve good results at the cost of increased complexity and computational expense, much of which arises from processing redundant information inherent in strongly correlated images from different sources. In this paper, we introduce an end-to-end lightweight encoder-decoder network that uses channel and spatial attention mechanisms to focus on the most relevant features from multi-source inputs and depthwise convolutions for efficient feature fusion. Our fusion block integrates convolutional layers with a Swin Transformer to capture both local details and global context. Comprehensive evaluations on various benchmarks demonstrate that our approach consistently rivals state-of-the-art methods while maintaining lower computational complexity. Furthermore, we evaluate the fused images on downstream tasks, including semantic segmentation on the MSRS dataset and object detection, showing that our approach enhances task-specific performance. Ablation studies further validate the effectiveness of our specific model design, such as the multi-attention integration, in achieving robust performance with reduced complexity.

Serve As Reviewer: ~Abhinav_Sagar1

Submission Number: 63

Loading