AutoDecoding Latent 3D Diffusion Models

Supplementary Material


Unconditional Generation on Objaverse:

We train an unconditional 3D Diffusion model on the Latent Features of a 3D AutoDecoder trained on Objaverse. After 256 diffusion steps, we upsample the generated latent volume to a 64x64x64 RGB-D grid. We produce and show renders from multiple views.


Direct Latent Sampling Generation on Objaverse:

We sample a random vector at the latent space of a 3D AutoDecode trained on Objaverse, as proposed by Unsupervised Volumentric Animation. Then, we decode it into 64x64x64 RGB-D voxel grid. We produce and show renders from multiple views.


Text-Driven Generation on Objaverse:

We train an text-conditioned 3D Diffusion model on the Latent Features of a 3D AutoDecoder trained on Objaverse. Captions were extracted using MiniGPT4. After 256 diffusion steps, we upsample the generated latent volume to a 64x64x64 RGB-D grid. During diffusion we apply classifier-free guidance with weight 3. We produce and show renders from multiple views.


Unconditional Generation on MVImgNet:

We train an unconditional 3D Diffusion model on the Latent Features of a 3D AutoDecoder trained on MVImgNet. After 256 diffusion steps, we upsample the generated latent volume to a 64x64x64 RGB-D grid. We produce and show renders from multiple views.


Direct Latent Sampling Generation on MVImgNet:

We sample a random vector at the latent space of a 3D AutoDecode trained on MVImgNet, as proposed by Unsupervised Volumentric Animation. Then, we decode it into 64x64x64 RGB-D voxel grid. We produce and show renders from multiple views.


Text-Driven Generation on MVImgNet:

We train an text-conditioned 3D Diffusion model on the Latent Features of a 3D AutoDecoder trained on MVImgNet. Captions were extracted using MiniGPT4. After 256 diffusion steps, we upsample the generated latent volume to a 64x64x64 RGB-D grid. During diffusion we apply classifier-free guidance with weight 3. We produce and show renders from multiple views.


Unconditional Generation of Articulated Objects on CelebV-Text:

Comparison of Direct Latent Sampling, our baseline, (Left) versus our approach (Right). We use a real video to drive the articulated motion of the generated faces. No Camera information is provided to the network; it is inferred during training.


Text-Driven Generation of Articulated Objects on CelebV-Text:

We visualize results novel views at -10, 0, and 10 degrees in the left, middle, and right part respectively . We use a real video to drive the articulated motion of the generated faces. No Camera information is provided to the network; it is inferred during training. We use 256 diffusion steps and classifier-free guidance with weight 3.