Probing Synergistic High-Order Interaction for Multi-Modal Image Fusion

Published: 01 Jan 2025, Last Modified: 10 Jul 2025IEEE Trans. Pattern Anal. Mach. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multi-modal image fusion aims to generate a fused image by integrating and distinguishing the cross-modality complementary information from multiple source images. While the cross-attention mechanism with global spatial interactions appears promising, it only captures second-order spatial interactions, neglecting higher-order interactions in both spatial and channel dimensions. This limitation hampers the exploitation of synergies between multi-modalities. To bridge this gap, we introduce a Synergistic High-order Interaction Paradigm (SHIP), designed to systematically investigate spatial fine-grained and global statistics collaborations between the multi-modal images across two fundamental dimensions: 1) Spatial dimension: we construct spatial fine-grained interactions through element-wise multiplication, mathematically equivalent to global interactions, and then foster high-order formats by iteratively aggregating and evolving complementary information, enhancing both efficiency and flexibility. 2) Channel dimension: expanding on channel interactions with first-order statistics (mean), we devise high-order channel interactions to facilitate the discernment of inter-dependencies between source images based on global statistics. We further introduce an enhanced version of the SHIP model, called SHIP++ that enhances the cross-modality information interaction representation by the cross-order attention evolving mechanism, cross-order information integration, and residual information memorizing mechanism. Harnessing high-order interactions significantly enhances our model’s ability to exploit multi-modal synergies, leading in superior performance over state-of-the-art alternatives, as shown through comprehensive experiments across various benchmarks in two significant multi-modal image fusion tasks: pan-sharpening, and infrared and visible image fusion.
Loading