From Pixels to Spectra: Efficient Generative Modeling via Frequency Masking

18 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Frequency space, generative models
Abstract: Diffusion models have recently achieved state-of-the-art results in image generation, but most high-performing methods operate in a compressed latent space, limiting fidelity to the reconstruction quality of the encoder–decoder. Ambient-space approaches preserve full information but incur substantial computational cost. In this work, we propose \emph{Masked Frequency Flow Matching}, a scalable and efficient diffusion framework that operates directly in the discrete cosine transform (DCT) domain. Our method leverages a compact frequency representation that enables multi-resolution training without the high channel count of prior DCT-based approaches. We introduce a frequency–spatial masking strategy and a masked diffusion transformer architecture tailored to this domain, substantially reducing FLOPs while maintaining high sample quality. On ImageNet at $256^2$ resolution, our approach outperforms prior ambient-space models in FID and efficiency, and scales effectively to higher resolutions. Extensive ablations confirm the importance of our masking mechanism and architecture design, establishing \emph{Masked Frequency Flow Matching} as a competitive alternative to both latent- and pixel-space diffusion models.
Primary Area: generative models
Code Of Ethics: true
Submission Guidelines: true
Anonymous Url: true
No Acknowledgement Section: true
Submission Number: 14581
Loading