Novel View Synthesis with Diffusion Models

3D generation from a single image

Anonymous ICLR 2023 Authors

We present 3DiM (pronounced "three-dim"), a diffusion model for 3D novel view synthesis from as few as a single image. The core of 3DiM is an image-to-image diffusion model -- 3DiM takes a single reference view and a relative pose as input, and generates a novel view via diffusion. 3DiM can then generate a full 3D consistent scene following our novel stochastic conditioning sampler. The output frames of the scene are generated autoregressively. During the reverse diffusion process of each individual frame, we select a random conditioning frame from the set of previous frames at each denoising step. We demonstrate that stochastic conditioning yields much more 3D consistent results compared to the naïve sampling process which only conditions on a single previous frame. We compare 3DiMs to prior work on the SRN ShapeNet dataset, demonstrating that 3DiM's generated videos from a single view achieve much higher fidelity while being approximately 3D consistent. We also introduce a new evaluation methodology, 3D consistency scoring, to measure the 3D consistency of a generated object by training a neural field on the model's output views. 3DiMs are geometry free, do not rely on hyper-networks or test-time optimization for novel view synthesis, and allow a single model to easily scale to a large number of scenes.


3DiM is an AI system that creates 3D renderings from a single input image.

Generation with 3DiM -- We propose stochastic conditioning, a new sampling strategy where we generate views autoregressively with an image-to-image diffusion model. At each denoising step, we condition on a random previous view, so the denoising process is guided to be 3D consistent to all previous frames with enough denoising steps.

Results on diverse data


We show select samples from a single 3DiM trained on all of ShapeNet. We rendered 250 views for each asset with kubric, and trained a 471M parameter 3DiM. Videos are sampled from a single input image, with 256 denoising steps, i.e., 512 model forward passes taking into account classifier-free guidance.

Pose Conditioning × Image-to-Image Diffusion


By allowing the core of 3DiM to remain an image-to-image model, we can bypass the difficulties of designing and training architectures that jointly model multiple frames. More importantly, we enable training with datasets that have as few as two views per scene.

3DiM research highlights


X-UNet -- Our proposed changes to the image-to-image UNet, which we show are critical to achieve high-quality results.

Comparisons to Prior Work


We compare against prior state-of-the-art methods on novel view synthesis from few images on the SRN ShapeNet benchmark. The methods whose outputs we could acquire all guarantee 3D consistency, due to the use of volume rendering (unlike 3DiM). We render the same trajectories given the same conditioning image.

Input View SRN PixelNeRF VisionNeRF 3DiM (ours) Ground Truth
}

State-of-the-art FID scores on SRN ShapeNet


Prior methods directly regress outputs, often leading to severe bluriness. We show that 3DiM overcomes this problem: it is a generative model by design, and diffusion models have a natural inductive bias towards generating much sharper samples. Below we show more samples from the 3DiMs we trained for prior work comparisons; a 471M parameter 3DiM for cars, and a 1.3B parameter 3DiM for chairs.