Effective and Efficient Masked Image Generation Models

Zebin You; Jingyang Ou; Xiaolu Zhang; Jun Hu; JUN ZHOU; Chongxuan Li

Effective and Efficient Masked Image Generation Models

Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, JUN ZHOU, Chongxuan Li

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce eMIGM, a powerful generative model that significantly accelerates the sampling speed while maintaining high image quality.

Abstract: Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as \textbf{eMIGM}. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet $256\times256$, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion model REPA while requiring less than 45\% of the NFE. Additionally, on ImageNet $512\times512$, eMIGM outperforms the strong continuous diffusion model EDM2. Code is available at \url{https://github.com/ML-GSAI/eMIGM}.

Lay Summary: Our new AI, eMIGM, creates realistic pictures by learning to fill in missing parts of images, much like solving a puzzle. We discovered that by combining the strengths of two existing image generation methods into a unified system, and carefully refining how it learns and creates, we could significantly boost both image quality and generation speed. eMIGM learns more effectively when more of an image is hidden during its training. When generating new pictures, it cleverly predicts fewer details initially and only receives stronger guidance later on, making it much faster. As a result, eMIGM produces high-quality images, outperforming standard methods and matching top-tier ones with significantly less computational power, even on large images. This work makes high-quality AI image generation more effective and efficient, and our code is publicly available.

Link To Code: https://github.com/ML-GSAI/eMIGM

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: masked diffusion model, mask image modeling, generative model, efficient, effective

Submission Number: 6388

Loading