Unseen Image Synthesis with Diffusion Models (a.k.a, UnseenDiffusion)
Anonymous Submission ID 299
In our work, given a DDPM trained on Dog faces, we are able to generate images from dramatically different image domains without changing the parameters of this base model.

1. High-Level Insights and Take-Away
As the high-level take-away, we seek to provide some insights on several open questions in the generative models with diffusion models.
1. Where is the limit of representation ability for DDPMs?
2. Where does the generalization ability of DDPMs come from?
3. What role does the stochastic Gaussian noise play in diffusion models?
For the first question, we proved via our arbitrary image reconstruction tests that a pre-trained DDPM on single-domain images already has sufficient representation ability to well construct any unseen images, following deterministic inversion and denoising trajectories. In other words, despite the inverted latent encodings may not be optimal in terms of sampling and denoising, we can at least indeed trace the trajectories in both directions.
For the second question, recall the generative process in the unconditional DDPMs (after the training has been completed), there are essentially two steps, namely latent sampling and denoising. The denoising process can be considered as a mapping trajectory between the latent space and real data space. Given the extreme case, one can always find latent encodings that correspond to real images given a relatively fixed denoising trajectory, therefore the biggest challenge really comes to the first latent sampling step when synthesizing new data.
To some degree, it seems that synthesizing new images, no matter ID or OOD image domains, is not a "creation" process, but rather a "discovery" process in the latent space. This is intrinsically different from the "mode collapse" issue in many GAN-based works, which describes a model-dependent issue where the mapping trajectories collapse to a similar ending point in the image space.
As for the third question, we notice some recent works touch on this interesting question from different angles. For instance, the ColdDiffusion[1] empirically indicates that the stochastic Gaussian noises may not be necessary for diffusion models to generate new data. The BoundaryDiffusion[2] reveals the "distance effect", which exists only in the deterministic formulations that lead to distorted images.
A potential unified answer to this open question based on our understanding is that stochastic noises may be a mitigation solution to relax the trade-off between sampling and denoising.
2. Representation Ability
This work is based on our key observation that a DDPM pre-trained even on a single domain image already has sufficient representation ability to accurately reconstruct arbitrary unseen images from the inverted latent encoding following a deterministic denoising trajectory[3]. As shown in the following examples, with iDDPM[4] trained on dog faces as the base model.

We hereby introduce the concept of bandwidth, which characterizes the tolerance of the given diffusion model on the degree of stochasticity for a target unseen domain. The bandwidth is an important property and parameter for unseen image synthesis tasks, with a detailed discussion presented in our paper.
3. Latent Sampling with Geometric Optimization
3.1 Latent Distribution Estimation
We revisit the inversion technique and its underlying theoretical support from [3], and notice that the actual diffusion (inversion) process is not dependant on the model, and establish Gaussian in intermediate latent spaces in theory. Therefore, we propose to use unseen images to estimate the intermediate Gaussians.
3.2 Geometric Optimization
However, we acknowledge that a Gaussian prior is insufficient for achieving unseen image synthesis in practice due to several reasons. Firstly, there exists always a gap between the theory and actual model training. Even for inverted ID latent encodings, the Gaussian assumption is not always correct [2]. Secondly, the latent encodings are easily captured by the ID trajectories.
To this end, we draw inspirations from recent BoundaryDiffusion work to leverage the geometric properties in the latent high-dimensional spaces as additional domain-specific and model-dependant information to optimize the latent sampling. Specifically, we consistently observe several unique geometric properties in the latent spaces that can be used as optimization constraints to reject unqualified sampled latent encodings.

3.3 UnseenDiffusion Method
To sum up, we propose our UnseenDiffusion method, featuring a training-free method for unseen image synthesis task, which consists of: latent distribution estimation, latent geometric optimization, and relatively deterministic denoising.

5. Experiments
We conduct extensive experiments on CelabA-HQ, LSUN-Church, LSUN-Bedroom, AFHQ-Dog datasets using different model architectures (DDPMs[5], improved DDPMs[4]), achieving the objective of synthesizing unseen images.
Notably, we explicitly clarify that the common "mode collapse" issue in generative models does not exist in this work, with detailed justifications given in our paper and appendices.



References
[1] Bansal, Arpit, Eitan Borgnia, Hong-Min Chu, Jie S. Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas Geiping, and Tom Goldstein. "Cold diffusion: Inverting arbitrary image transforms without noise." arXiv preprint arXiv:2208.09392 (2022).
[2] Zhu, Ye, Yu Wu, Zhiwei Deng, Olga Russakovsky, and Yan Yan. "Boundary guided mixing trajectory for semantic control with diffusion models." In NeurIPS 2023.
[3] Song, Jiaming, Chenlin Meng, and Stefano Ermon. “Denoising diffusion implicit models.” In ICLR 2021.
[4] Nichol, Alexander Quinn, and Prafulla Dhariwal. “Improved denoising diffusion probabilistic models.” In ICML, 2021.
[5] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” In NeurIPs 2020.