MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

Sen Xing; Muyan Zhong; Zeqiang Lai; Liangchen Li; Jiawen Liu; Yaohui Wang; Jifeng Dai; Wenhai Wang

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

Sen Xing, Muyan Zhong, Zeqiang Lai, Liangchen Li, Jiawen Liu, Yaohui Wang, Jifeng Dai, Wenhai Wang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this work, we explore a cost-effective framework for multilingual image generation. We find that, unlike models tuned on high-quality images with multilingual annotations, leveraging text encoders pre-trained on widely available, noisy Internet image-text pairs significantly enhances data efficiency in text-to-image (T2I) generation across multiple languages. Based on this insight, we introduce MuLan, Multi-Language adapter, a lightweight language adapter with fewer than 20M parameters, trained alongside a frozen text encoder and image diffusion model. Compared to previous multilingual T2I models, this framework offers: (1) Cost efficiency. Using readily accessible English data and off-the-shelf multilingual text encoders minimizes the training cost; (2) High performance. Achieving comparable generation capabilities in over 110 languages with CLIP similarity scores nearly matching those in English (39.57 for English vs. 39.61 for other languages); and (3) Broad applicability. Seamlessly integrating with compatible community tools like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.

Lay Summary: Generating images from text descriptions is usually resource-intensive, particularly for languages other than English, because high-quality multilingual datasets are costly and limited. To address this, we developed MuLan, a cost-effective method for multilingual image generation. Instead of using expensive, high-quality datasets, we used widely available, noisy Internet data combined with multilingual text encoders. Our key innovation is a lightweight language adapter, called MuLan, which has fewer than 20 million parameters and works alongside existing text and image models. MuLan significantly reduces training costs while achieving high-quality image generation in more than 110 languages. Remarkably, the image quality for other languages closely matches the performance in English, making advanced text-to-image generation accessible globally. Furthermore, our approach easily integrates with popular community tools, enhancing its versatility and potential applications. This work democratizes access to powerful multilingual image generation technologies, allowing users around the world to create high-quality images efficiently and affordably.

Link To Code: https://github.com/mulanai/MuLan

Primary Area: Applications->Computer Vision

Keywords: Image Generation

Submission Number: 9499

Loading