Learning A Low-Level Vision Generalist via Visual Task Prompt

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Building a unified model for general low-level vision tasks has important research and practical value. However, existing methods still face challenges when dealing with diverse low-level vision problems. Multi-task restoration approaches can simultaneously address various degradation-to-clean restoration tasks, while their applicability to tasks with different target domains (e.g., image stylization) remains limited. Existing methods like PromptGIP that can handle tasks with multiple input-target domains mainly rely on the Masked Autoencoder (MAE) training paradigm. Unfortunately, these approaches are restricted by coupling to the ViT architecture, resulting in suboptimal image reconstruction quality. In addition, they tend to be sensitive to prompt content and often fail when handling more tasks that involve low-frequency information processing, such as color and style. In this paper, we present a Visual task Prompt-based Image Processing (VPIP) framework to address the above challenges. This framework employs the visual task prompt to process tasks with different input-target domains. Besides, it provides the flexibility to select a backbone network suitable for various low-level vision tasks. A prompt cross-attention mechanism is introduced to deal with the information interaction between the input and prompt information. Based on the VPIP framework, we train a low-level vision generalist model, namely GenLV, on 30 diverse tasks. Experimental results show that GenLV can successfully address a variety of low-level tasks, and it significantly outperforms existing methods both quantitatively and qualitatively.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: This work innovatively applies visual task prompt to the processing of general low-level vision tasks in a new way. In most previous works, prompts are often used for controllable image generation or editing in the form of language modality. In this work, we show that visual task prompt is also a potential solution to deal with general low-level vision tasks. We believe that it can inspire more work on using mixed modalities for controllable image restoration/generation/editing models.
Supplementary Material: zip
Submission Number: 5085
Loading