Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities

Ruchika Chavhan; Abhinav Mehrotra; Malcolm Chadwick; Alberto Gil Couto Pimentel Ramos; Luca Morreale; Mehdi Noroozi; Sourav Bhattacharya

Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities

Ruchika Chavhan, Abhinav Mehrotra, Malcolm Chadwick, Alberto Gil Couto Pimentel Ramos, Luca Morreale, Mehdi Noroozi, Sourav Bhattacharya

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Multi-task Learning in Diffusion Models

Abstract: Text-to-image synthesis has witnessed remarkable advancements in recent years. Many attempts have been made to adopt text-to-image models to support multiple tasks. However, existing approaches typically require resource-intensive re-training or additional parameters to accommodate for the new tasks, which makes the model inefficient for on-device deployment. We propose *Multi-Task Upcycling* (MTU), a simple yet effective recipe that extends the capabilities of a pre-trained text-to-image diffusion model to support a variety of image-to-image generation tasks. MTU replaces Feed-Forward Network (FFN) layers in the diffusion model with smaller FFNs, referred to as *experts*, and combines them with a dynamic routing mechanism. To the best of our knowledge, MTU is the first multi-task diffusion modeling approach that seamlessly blends multi-tasking with on-device compatibility, by mitigating the issue of parameter inflation. We show that the performance of MTU is on par with the single-task fine-tuned diffusion models across several tasks including *image editing, super-resolution*, and *inpainting*, while maintaining similar latency and computational load (GFLOPs) as the single-task fine-tuned models.

Lay Summary: Imagine having a powerful, personalised AI assistant right on your phone—one that not only chats with you but can also edit, enhance, or fill in parts of your photos. The problem is, today’s image generation models are huge and too heavy to run on a phone, which has limited computing power. Shrinking these models without losing their abilities is a tough challenge. So, we asked: how can we take a compact, on-device model and train it to handle multiple image generation tasks? Inspired by how the brain assigns jobs to different areas, we explored pre-trained image models and found a key component responsible for solving image generation tasks. We split this part into smaller, specialised units called "experts." Each expert focuses on certain types of image tasks, and depending on what the model is doing, the most relevant experts get more say in the final result. With this setup, we turned a single lightweight model into a flexible multitasker—capable of doing four different image generation jobs—all while staying efficient enough to run on your phone.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: multi-task learning, upcycling, diffusion models

Submission Number: 11801

Loading