Keywords: Multimodal Image Fusion, Spatio-Temporal Imbalance, diffusion-based dynamic image fusion
Abstract: Image fusion integrates complementary information from multiple sources to generate more informative results. Recently, the diffusion model, which demonstrates unprecedented generative potential, has been explored in the context of image fusion. During diffusion model generation, information emerges at unequal rates, so the fusion should dynamically weight the source modalities. To address this issue, we reveal a significant spatio-temporal imbalance in image denoising; specifically, the diffusion model produces dynamic information gains in different image regions with denoising steps. Based on this observation, we dive into the Diffusion Information Gains (DIG) and theoretically derive a diffusion-based dynamic image fusion framework that provably reduces its upper bound of the generalization error. Accordingly, we introduce diffusion information gains to quantify the information contribution of each modality at different denoising steps, thereby providing dynamic guidance during the fusion process. Experiments on multiple fusion scenarios confirm that our method outperforms existing diffusion-based approaches in terms of both fusion quality and inference efficiency.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 688
Loading