Saint: Spatial Guidance for Inpainting

Christopher Wewer; Bart Pogodzinski; Bernt Schiele; Jan Eric Lenssen

Saint: Spatial Guidance for Inpainting

Christopher Wewer, Bart Pogodzinski, Bernt Schiele, Jan Eric Lenssen

16 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: image inpainting, diffusion models, transformers

Abstract: We introduce Saint, a framework for image inpainting with large-scale diffusion and flow-based transformers in a latent multi-variable setup. Existing methods for latent image inpainting rely on RePaint-like sampling or mask concatenation, which either does not make use of the masked image as strong conditioning at all or neglects the fact that the denoising model has been already trained for masking via noising. In contrast, Saint fine-tunes pre-trained Diffusion Transformers (DiTs) as Spatial Reasoning Models (SRMs) with varying noise levels across masked and unmasked regions, allowing to condition the model directly via the partially noised latent. This more effective conditioning scheme improves inpainting performance on binary masks and further extends to continuous masks. Moreover, the multi-variable formulation of SRMs allows us to formulate a Spatial Classifier-Free Guidance strategy tailored for inpainting as well as a token-caching scheme for efficient local edits. We evaluate Saint on ImageNet1k and JourneyDB datasets for a variety of inpainting scenarios and show that it consistently improves on the state of the art in generative and reconstruction metrics. Our codebase and the models will be released publicly upon acceptance of the paper.

Primary Area: generative models

Submission Number: 8028

Loading