<!doctype html>
<html lang="en">
<head>
	<!-- Required meta tags -->
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

	<!-- Bootstrap CSS -->
	<link href="resources/bootstrap.min.css" rel="stylesheet">
        <link rel="stylesheet" href="./resources/bulma.min.css">
        <link rel="stylesheet" href="./resources/bulma-carousel.min.css">
	  <style>
	
	    .custom-carousel-wrapper {
	      display: flex;
	      justify-content: center;
	    }
	    .custom-carousel-cond {
	      width: 1024px;
	      height: 640px;
	      align: center;
	    }

	    .custom-carousel-unceleb {
	      width: 630px;
	      height: flex;
	      align: center;
	    }
	
	    .custom-carousel-uncond {
	      width: 768px;
	      height: 640px;
	      align: center;
	    }
	  </style>
        <link rel="stylesheet" href="./resources/bulma-slider.min.css">
        <link rel="stylesheet" href="./resources/fontawesome.all.min.css">
        <link rel="stylesheet" href="./resources/index.css">
 
        <link rel="stylesheet" href="./resources/academicons.min.css">

        <script src="./resources/jquery.min.js"></script>
        <script defer src="./resources/fontawesome.all.min.js"></script>
        <script src="./resources/bulma-carousel.min.js"></script>
        <script src="./resources/bulma-slider.min.js"></script>
        <script src="./resources/index.js"></script>

	<title>AutoDecoding Latent 3D Diffusion Models</title>
</head>
<body>

	<section class="jumbotron text-center">
            <h1 class="publication-title">AutoDecoding Latent 3D Diffusion Models</h1>
            <h3 class="publication-title">Supplementary Material</h3>
            <br/>
	</section>

        <div class="container pt-5">
           <div class="content">
	    <div>
                      <h2 class="title is-3">Unconditional Generation on Objaverse:</h2>
		      <p class="lead">We train an unconditional 3D Diffusion model on the Latent Features of a 3D AutoDecoder trained on Objaverse. After 256 diffusion steps, we upsample the generated latent volume to a 64x64x64 RGB-D grid. We produce and show renders from multiple views.</p>
	    </div>
            <br/>
            <div class="custom-carousel-wrapper">
            <div class="carousel custom-carousel-uncond" style="overflow: hidden">
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_uncond/output_0.mp4" type="video/mp4">
						</video>			
			         </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_uncond/output_1.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_uncond/output_2.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_uncond/output_3.mp4" type="video/mp4">
						</video>			
				       </div>
	
            </div>
            </div>
             </div>


        <div class="container pt-5">
           <div class="content">
	    <div>
                      <h2 class="title is-3">Direct Latent Sampling Generation on Objaverse:</h2>
		      <p class="lead">We sample a random vector at the latent space of a 3D AutoDecode trained on Objaverse, as proposed by Unsupervised Volumentric Animation. Then, we decode it into 64x64x64 RGB-D voxel grid. We produce and show renders from multiple views.</p>
	    </div>
            <br/>
            <div class="custom-carousel-wrapper">
            <div class="carousel custom-carousel-uncond" style="overflow: hidden">
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_direct/output_0.mp4" type="video/mp4">
						</video>			
			         </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_direct/output_1.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_direct/output_2.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_direct/output_3.mp4" type="video/mp4">
						</video>			
				       </div>

				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_direct/output_4.mp4" type="video/mp4">
						</video>			
				       </div>
	
           </div>
           </div>
           </div>

 
       
        <div class="container pt-5">
           <div class="content">
	    <div>
                      <h2 class="title is-3">Text-Driven Generation on Objaverse:</h2>
		      <p class="lead">We train an text-conditioned 3D Diffusion model on the Latent Features of a 3D AutoDecoder trained on Objaverse. Captions were extracted using MiniGPT4. After 256 diffusion steps, we upsample the generated latent volume to a 64x64x64 RGB-D grid. During diffusion we apply classifier-free guidance with weight 3. We produce and show renders from multiple views.</p>
		      <p class="lead"> </p>
	    </div>
            <br/>
            <div class="custom-carousel-wrapper">
            <div class="carousel custom-carousel-cond" style="overflow: hidden">
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_cond/output_0.mp4" type="video/mp4">
						</video>			
			         </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_cond/output_1.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_cond/output_2.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_cond/output_3.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
	
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/obj_cond/output_4.mp4" type="video/mp4">
						</video>			
			       </div>
	
            </div>
            </div>
             </div>



       
        <div class="container pt-5">
           <div class="content">
	    <div>
                      <h2 class="title is-3">Unconditional Generation on MVImgNet:</h2>
		      <p class="lead">We train an unconditional 3D Diffusion model on the Latent Features of a 3D AutoDecoder trained on MVImgNet. After 256 diffusion steps, we upsample the generated latent volume to a 64x64x64 RGB-D grid. We produce and show renders from multiple views.</p>
		      <p class="lead">  </p>
	    </div>
            <br/>
            <div class="custom-carousel-wrapper">
            <div class="carousel custom-carousel-uncond" style="overflow: hidden">
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_uncond/output_0.mp4" type="video/mp4">
						</video>			
			         </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_uncond/output_1.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_uncond/output_2.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_uncond/output_3.mp4" type="video/mp4">
						</video>			
				       </div>
	
            </div>
            </div>
             </div>



        <div class="container pt-5">
           <div class="content">
	    <div>
                      <h2 class="title is-3">Direct Latent Sampling Generation on MVImgNet:</h2>
		      <p class="lead">We sample a random vector at the latent space of a 3D AutoDecode trained on MVImgNet, as proposed by Unsupervised Volumentric Animation. Then, we decode it into 64x64x64 RGB-D voxel grid. We produce and show renders from multiple views.</p>
		      <p class="lead"></p>
	    </div>
            <br/>
            <div class="custom-carousel-wrapper">
            <div class="carousel custom-carousel-uncond" style="overflow: hidden">
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_direct/output_0.mp4" type="video/mp4">
						</video>			
			         </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_direct/output_1.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_direct/output_2.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_direct/output_3.mp4" type="video/mp4">
						</video>			
				       </div>

				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_direct/output_4.mp4" type="video/mp4">
						</video>			
				       </div>
	
           </div>
           </div>
           </div>
 


         <div class="container pt-5">
           <div class="content">
	    <div>
                      <h2 class="title is-3">Text-Driven Generation on MVImgNet:</h2>
		      <p class="lead">We train an text-conditioned 3D Diffusion model on the Latent Features of a 3D AutoDecoder trained on MVImgNet. Captions were extracted using MiniGPT4. After 256 diffusion steps, we upsample the generated latent volume to a 64x64x64 RGB-D grid. During diffusion we apply classifier-free guidance with weight 3. We produce and show renders from multiple views.</p>
	    </div>
            <br/>
            <div class="custom-carousel-wrapper">
            <div class="carousel custom-carousel-cond" style="overflow: hidden">
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_cond/output_0.mp4" type="video/mp4">
						</video>			
			         </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_cond/output_1.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_cond/output_2.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_cond/output_3.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
	
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/mvin_cond/output_4.mp4" type="video/mp4">
						</video>			
			       </div>
	
            </div>
            </div>
             </div>

      
        <div class="container pt-5">
           <div class="content">
	    <div>
                      <h2 class="title is-3">Unconditional Generation of Articulated Objects on CelebV-Text:</h2>
		      <p class="lead"> Comparison of Direct Latent Sampling, our baseline, (Left) versus our approach (Right). We use a real video to drive the articulated motion of the generated faces. No Camera information is provided to the network; it is inferred during training. </p>
	    </div>
            <br/>
            <div class="custom-carousel-wrapper">
            <div class="carousel custom-carousel-unceleb" style="overflow: hidden">
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/celeb_uncond/output_0.mp4" type="video/mp4">
						</video>			
			         </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/celeb_uncond/output_1.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/celeb_uncond/output_2.mp4" type="video/mp4">
						</video>			
				       </div>
			       </div>
	
            </div>
            </div>
             </div>


        <div class="container pt-5">
           <div class="content">
	    <div>
                      <h2 class="title is-3">Text-Driven Generation of Articulated Objects on CelebV-Text:</h2>
		      <p class="lead"> We visualize results novel views at -10, 0, and 10 degrees in the left, middle, and right part respectively . We use a real video to drive the articulated motion of the generated faces. No Camera information is provided to the network; it is inferred during training. We use 256 diffusion steps and classifier-free guidance with weight 3.</p>

	    </div>
            <br/>
            <div class="custom-carousel-wrapper">
            <div class="carousel custom-carousel-cond" style="overflow: hidden">
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/celeb_cond/output_0.mp4" type="video/mp4">
						</video>			
			         </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/celeb_cond/output_1.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/celeb_cond/output_2.mp4" type="video/mp4">
						</video>			
				       </div>
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/celeb_cond/output_3.mp4" type="video/mp4">
						</video>			
				       </div>
	
				 <div class="item">
						<video poster="" autoplay muted loop playsinline controls>
						      <source src="./media/celeb_cond/output_4.mp4" type="video/mp4">
						</video>			
				       </div>
	
			       </div>
	
            </div>
            </div>
             </div>





</body>
</html>
