IVTFuse: An Efficient Vision-Language Guided Infrared-Visible Image Fusion Network with Frequency-Strip and Hybrid Pooling Attention Modules

Published: 08 Oct 2025, Last Modified: 16 Oct 2025Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image fusion, Vision-language model
Abstract: Infrared-visible image fusion (IVF) aims to combine complementary thermal and visible information into a single image that is informative for both human observation and computer vision tasks. However, existing fusion methods often struggle to preserve both the fine details and the semantic context of a scene, especially when relying solely on image-based features. We propose $\textbf{IVTFuse}$, a novel vision-language guided fusion network that addresses these challenges by incorporating textual semantic guidance and frequency-aware attention mechanisms. IVTFuse introduces two lightweight modules: $\textit{Frequency Strip Attention}$ (FSA) and $\textit{Hybrid Pooling Attention}$ (HPA), within each modality-specific encoder to adaptively enhance crucial structures and regions. Meanwhile, a text description of the scene is encoded by a pre-trained BLIP model and injected into the fusion process through cross-attention, providing high-level context to guide feature merging. Our architecture is built on efficient Restormer-based transformers and maintains a compact model size, making it feasible for real-time applications. Extensive experiments on standard infrared-visible fusion benchmarks show that IVTFuse outperforms 10 state-of-the-art methods across three public IVF datasets, producing fused images with improved detail and semantic fidelity.
Supplementary Material: zip
Submission Number: 23
Loading