FuseAgent: A VLM-driven Agent for Unified In-the-Wild Image Fusion

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image Fusion, Low Level, Multimodal Large Language Model, Multimodal Agent
Abstract: Fusing multi-source images captured in the wild is often undermined by unpredictable and coupled degradations, including pixel-level misalignment, adverse weather, and dynamic artifacts. Existing solutions face notable limitations: (1) Task-specific models rely on predefined degradation priors and fail to generalize to the complex, coupled degradations present in real-world scenarios. (2) All-in-one methods, while designed for multi-fusion tasks, frequently overlook the degradation inherent in input images, leading to suboptimal performance. To address these challenges, we introduce FuseAgent, a VLM- powered agent system that autonomously identifies degradations in the input images and dynamically coordinates expert models to execute a tailored fusion strategy. FuseAgent undergoes a two-stage training process: an initial supervised fine-tuning (SFT) establishes basic degradation perception and tool-use skill, followed by Group Relative Policy Optimization for fusion (GRPO-F) augmented with multi-dimensional rewards to further enhance its decision-making and tool proficiency. Experimental results demonstrate the superior performance of FuseAgent in handling complex and coupled degradations in real-world scenes, achieving a **20\%** average improvement across all evaluation metrics on challenging in-the-wild benchmarks.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16187
Loading