Abstract: Universal few-shot dense prediction requires a versatile model capable of learning any dense prediction task from limited labeled images, which necessitates the model to possess efficient adaptation abilities. Prevailing few-shot learning methods rely on efficient fine-tuning of model weights for few-shot adaptation, which carries the risk of disrupting the pre-trained knowledge and lacks the capability to extract task-specific knowledge contained in the pre-trained model. To overcome these limitations, our paper approaches universal few-shot dense prediction from a novel perspective. Unlike conventional fine-tuning techniques that directly use all parameters of the model and modify a specific set of weights for few-shot adaptation, our method focuses on selecting the task-relevant computation pathways of the pre-trained model while keeping the model weights frozen. Building upon this idea, we introduce a novel framework UniDense for universal few-shot dense prediction. First, we construct a versatile MoE architecture for dense prediction based on the Stable Diffusion model. We then utilize episodes-based meta-learning to train a set of routers for this MoE model, called Meta-Routers, which act as hyper-networks responsible for selecting computation blocks relevant to each task. We demonstrate that fine-tuning these meta-routers for novel tasks enables efficient adaptation of the entire model. Moreover, for each few-shot task, we leverage support samples to extract a task embedding, which serves as a conditioning factor for meta-routers. This strategy allows meta-routers to dynamically adapt themselves for different few-shot task, leading to improved adaptation performance. Experiments on a challenging variant of Taskonomy dataset with 10 dense prediction tasks demonstrate the superiority of our approach.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This paper explores the application of pre-trained knowledge from the multi-model generative foundation model Stable Diffusion to a downstream task known as universal few-shot dense prediction. This task involves training a model to perform arbitrary dense prediction tasks with limited labeled images. Although diffusion models are primarily trained using generative loss, their features have demonstrated remarkable performance in specific visual perception tasks, particularly in dense prediction tasks that require a comprehensive understanding of pixel-level fine-grained information. However, existing works either focus on utilizing diffusion models for a specific task or address multiple visual perception tasks without considering the few-shot setting. Consequently, there is currently no effective and elegant method to fully exploit the pre-trained knowledge of diffusion models and adapt them to various few-shot visual perception tasks in a universal manner. To the best of our knowledge, our work is the first attempt to bridge this gap and address this challenge.
Supplementary Material: zip
Submission Number: 556
Loading