From Pixels to Spectra: Efficient Generative Modeling via Frequency Masking

Quan Dao; Yuyang Wang; David Berthelot; Jiatao Gu; Xiaoming Zhao; Joshua M. Susskind; Miguel Ángel Bautista

From Pixels to Spectra: Efficient Generative Modeling via Frequency Masking

Quan Dao, Yuyang Wang, David Berthelot, Jiatao Gu, Xiaoming Zhao, Joshua M. Susskind, Miguel Ángel Bautista

18 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Frequency space, generative models

Abstract: Diffusion models have recently achieved state-of-the-art results in image generation, but most high-performing methods operate in a compressed latent space, limiting fidelity to the reconstruction quality of the encoder–decoder. Ambient-space approaches preserve full information but incur substantial computational cost. In this work, we propose \emph{Masked Frequency Flow Matching}, a scalable and efficient diffusion framework that operates directly in the discrete cosine transform (DCT) domain. Our method leverages a compact frequency representation that enables multi-resolution training without the high channel count of prior DCT-based approaches. We introduce a frequency–spatial masking strategy and a masked diffusion transformer architecture tailored to this domain, substantially reducing FLOPs while maintaining high sample quality. On ImageNet at $256^2$ resolution, our approach outperforms prior ambient-space models in FID and efficiency, and scales effectively to higher resolutions. Extensive ablations confirm the importance of our masking mechanism and architecture design, establishing \emph{Masked Frequency Flow Matching} as a competitive alternative to both latent- and pixel-space diffusion models.

Primary Area: generative models

Code Of Ethics: true

Submission Guidelines: true

Anonymous Url: true

No Acknowledgement Section: true

Submission Number: 14581

Loading