ControlFuse: Instruction-guided Multi-Granularity Controllable Image Fusion

Zeyu Wang

Published: 29 Jan 2026, Last Modified: 06 Feb 2026AAAI (Oral)-2026EveryoneRevisionsCC BY 4.0

Abstract: Infrared and visible image fusion (IVIF) integrates complementary visual information to produce enhanced representations. However, most existing IVIF methods generate fixed outputs, lacking the flexibility to adapt to user-specified requirements. Recent text-guided approaches offer partial controllability but remain limited to global or semantic-level fusion, unable to achieve instance-level control. This limitation primarily arises from two challenges: the absence of datasets linking textual instructions with corresponding spatial annotations, and the use of coarse cross-modal alignment methods incapable of accurately matching textual inputs with visual features. To overcome these challenges, we propose ControlFuse, a controllable IVIF framework enabling multi-granularity fusion guided directly by user instructions. First, we construct an automated multi-granularity dataset that provides explicit textual-mask correspondences at global, semantic, and instance levels. Second, inspired by manifold geometry, we design a multimodal feature interaction module consisting of a Feature Manifold Converter (FMC) and Curvature-Guided Interaction (CGI). FMC projects textual and visual features into a unified manifold space, while CGI leverages manifold curvature as a geometric cue to refine cross-modal alignment. Extensive experiments demonstrate that ControlFuse achieves precise and flexible controllability across different fusion granularities, benefiting high-level tasks.