Abstract: This paper proposes an effective attention-scale sparse cascade network, termed ASCNet, for multimodal infrared and visible image fusions, which aims to highlight salient objects, extract multi-dimensional features and improve computational efficiency. The constructed ASCNet consists of an encoder, a fusion strategy, and a decoder, capable of generating a potential image with good performance. In the encoder of ASCNet, the squeeze and excitation module (SE) is deployed and combined with the first-level feature maps that achieved by convolving the source images on the previous layer to obtain second-level feature maps, which is followed by convoluted to them to obtain the cascade maps, and then the two kinds of maps are connected in a novel sparse cascade manner to form an SE cascade block (SECB). After three identical SECBs in a series, significant texture information from the input images is extracted and well preserved. Then, a three-scale mechanism is embedded at the end of the SECBs to expand the receptive field for extracting deep-level features, namely using three convolution kernels of different sizes to extract features of different scales, and the final output maps are obtained in the same sparse cascade manner. Comparing the potential images with the input images, we notice that the potential images suffer from loss of detail features, although they perform well in terms of global contrast. In this instance, we propose a multi-component optimization block based on three feature components (contrast, visible texture, and infrared luminance) to achieve better fusion enhancement. Experiments using public datasets demonstrate that our method outperforms several state-of-the-art methods in terms of both subjective and objective indicators.
External IDs:dblp:journals/mms/ZhangKL25
Loading