TeRF: Text-driven and Region-aware Flexible Visible and Infrared Image Fusion

Hebaixu Wang; Hao Zhang; Xunpeng Yi; Xinyu Xiang; Leyuan Fang; Jiayi Ma

TeRF: Text-driven and Region-aware Flexible Visible and Infrared Image Fusion

Hebaixu Wang, Hao Zhang, Xunpeng Yi, Xinyu Xiang, Leyuan Fang, Jiayi Ma

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The fusion of visible and infrared images aims to produce high-quality fusion images with rich textures and salient target information. Existing methods lack interactivity and flexibility in the execution of fusion. It is unfeasible to express the requirements to modify the fusion effect, and the different regions in the source images are treated equally across the identical fusion model, which causes the fusion homogenization and low distinction. Besides, their pre-defined fusion strategies invariably lead to monotonous effects, which are insufficiently comprehensive. They fail to adequately consider data credibility, scene illumination, and noise degradation inherent in the source information. To address these issues, we propose the Text-driven and Region-aware Flexible visible and infrared image fusion, termed as TeRF. On the one hand, we propose a flexible image fusion framework with multiple large language and vision models, which facilitates the visual-text interaction. On the other hand, we aggregate comprehensive fine-tuning paradigms for the different fusion requirements to build a unified fine-tuning pipeline. It allows the linguistic selection of the regions and effects, yielding visually appealing fusion outcomes. Extensive experiments demonstrate the competitiveness of our method both qualitatively and quantitatively compared to existing state-of-the-art methods.

Primary Subject Area: [Content] Multimodal Fusion

Secondary Subject Area: [Content] Vision and Language

Relevance To Conference: In this work, we propose a text-driven and region-aware framework for visible and infrared image fusion with high interactivity and flexibility. It ensembles the large language and vision models for linguistically modifying the fusion effects of different regions. We also devise a high-performance fusion backbone to attain superior fusion precursors. Furthermore, a unified fine-tuning pipeline is constructed for the flexible fusion modification, which fully concerns the comprehensive fusion strategies.

Supplementary Material: zip

Submission Number: 2025

Loading