Keywords: Image Fusion, Low Level, Multimodal Large Language Model, Multimodal Agent
Abstract: Fusing multi-source images captured in the wild is often undermined by unpredictable and coupled degradations, including pixel-level misalignment, adverse weather, and dynamic artifacts. Existing solutions face notable limitations: (1) Task-specific models rely on predefined degradation priors and fail to generalize to the complex, coupled degradations present in real-world scenarios. (2) All-in-one methods, while designed for multi-fusion tasks, frequently overlook the degradation inherent in input images, leading to suboptimal performance. To address these challenges, we introduce FuseAgent, a VLM-
powered agent system that autonomously identifies degradations in the input images and dynamically coordinates expert models to execute a tailored fusion strategy. FuseAgent undergoes a
two-stage training process: an initial supervised fine-tuning (SFT) establishes basic degradation perception and tool-use skill, followed by Group Relative Policy
Optimization for fusion (GRPO-F) augmented with multi-dimensional rewards to further enhance its decision-making and tool proficiency. Experimental results demonstrate the superior performance of FuseAgent in handling complex and coupled degradations in real-world scenes, achieving a **20\%** average improvement across all evaluation metrics on challenging in-the-wild benchmarks.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16187
Loading