Sparse Image Synthesis via Joint Latent and RoI Flow

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion models, flow matching, sparse diffusion, sparse generation
Abstract: Natural images often exhibit underlying sparse structures, with information density varying significantly across different spatial locations. However, most generative models rely on dense grid-based pixels or latents, neglecting this inherent sparsity. In this paper, we explore modeling visual generation paradigm via sparse non-grid latent representations. Specifically, we design a sparse autoencoder that represents an image as a small number of latents with their positional properties (i.e., regions of interest, RoIs) with high reconstruction quality. We then explore training flow-matching transformers jointly on non-grid latents and RoI values. To the best knowledge, we are the first to address spatial sparsity using RoIs in generative process. Experimental results show that our sparse flow-based transformers have competitive performance compared with dense grid-based counterparts with significantly reduced lower compute, and reaches a competitive 2.76 FID with just 64 latents on class-conditional ImageNet $256\times 256$ generation.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 15869
Loading