Abstract: Existing multi-modal image fusion algorithms are typically designed for high-quality images and fail to tackle degradation (e.g., low light, low resolution, and noise), which restricts image fusion from unleashing the potential in practice. In this work, we present Degradation-Robust Multi-modality image Fusion (DRMF), leveraging the powerful generative properties of diffusion models to counteract various degradations during image fusion. Our critical insight is that generative diffusion models driven by different modalities and degradation are inherently complementary during the denoising process. Specifically, we pre-train multiple degradation-robust conditional diffusion models for different modalities to handle degradations. Subsequently, the diffusion priori combination module is devised to integrate generative priors from pre-trained uni-modal models, enabling effective multi-modal image fusion. Extensive experiments demonstrate that DRMF excels in infrared-visible and medical image fusion, even under complex degradations.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This work studies robust multimodal fusion on the topic of multimedia content understanding. Specifically, we propose a novel diffusion model-based framework for multimodal image fusion, which harnesses the powerful generative properties of diffusion models to mitigate various complex degradations in source images. Furthermore, a diffusion prior combination module is devised to aggregate generative diffusion priors from different modalities, effectively leveraging the complementary nature of diffusion models driven by various modalities and degradations. Extensive experiments on the infrared and visible image fusion and medical image fusion tasks demonstrate the superiority of our proposed method, especially when source images are afflicted by composite degradations. Additionally, experiments on subsequent high-level visual tasks, such as objection detection, reveal that this work can effectively boost multimedia content understanding via complementary information enhancement and aggregation.
Supplementary Material: zip
Submission Number: 2367
Loading