Keywords: image inpainting, diffusion models, transformers
Abstract: We introduce Saint, a framework for image inpainting with large-scale diffusion and flow-based transformers in a latent multi-variable setup. Existing methods for latent image inpainting rely on RePaint-like sampling or mask concatenation, which either does not make use of the masked image as strong conditioning at all or neglects the fact that the denoising model has been already trained for masking via noising. In contrast, Saint fine-tunes pre-trained Diffusion Transformers (DiTs) as Spatial Reasoning Models (SRMs) with varying noise levels across masked and unmasked regions, allowing to condition the model directly via the partially noised latent. This more effective conditioning scheme improves inpainting performance on binary masks and further extends to continuous masks. Moreover, the multi-variable formulation of SRMs allows us to formulate a Spatial Classifier-Free Guidance strategy tailored for inpainting as well as a token-caching scheme for efficient local edits. We evaluate Saint on ImageNet1k and JourneyDB datasets for a variety of inpainting scenarios and show that it consistently improves on the state of the art in generative and reconstruction metrics. Our codebase and the models will be released publicly upon acceptance of the paper.
Primary Area: generative models
Submission Number: 8028
Loading