<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8" />
    <meta
      name="description"
      content="Ouroboros3D, a unified 3D generation framework, which integrates diffusion-based multi-view image generation and 3D reconstruction into a recursive diffusion process." />
    <meta
      name="keywords"
      content="Ouroboros3D, Image-to-3D, Image-to-Multiview" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <title>
      Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion
    </title>

    <link
      href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
      rel="stylesheet" />

    <link rel="stylesheet" href="./static/css/bulma.min.css" />
    <link rel="stylesheet" href="./static/css/bulma-carousel.min.css" />
    <link rel="stylesheet" href="./static/css/bulma-slider.min.css" />
    <link rel="stylesheet" href="./static/css/fontawesome.all.min.css" />
    <link
      rel="stylesheet"
      href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css" />
    <link rel="stylesheet" href="./static/css/index.css" />
    <link rel="icon" href="./static/images/favicon.svg" />

    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
    <script defer src="./static/js/fontawesome.all.min.js"></script>
    <script src="./static/js/bulma-carousel.min.js"></script>
    <script src="./static/js/bulma-slider.min.js"></script>
    <script src="./static/js/index.js"></script>
  </head>
  <body>
    <style>
      .container {
        max-width: 1100px;
        margin: 0 auto;
      }
    </style>

    <nav class="navbar" role="navigation" aria-label="main navigation">
      <div class="navbar-brand">
        <a
          role="button"
          class="navbar-burger"
          aria-label="menu"
          aria-expanded="false">
          <span aria-hidden="true"></span>
          <span aria-hidden="true"></span>
          <span aria-hidden="true"></span>
        </a>
      </div>
    </nav>

    <section class="hero">
      <div class="hero-body">
        <div class="container is-max-desktop">
          <div class="columns is-centered">
            <div class="column has-text-centered">
              <h1 class="title is-1 publication-title">
                Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive
                Diffusion
              </h1>
              <!-- <h2 class="title is-3 publication-title" style="color:#008bf5;">CVPR 2024</h2> -->
            </div>
          </div>
        </div>
      </div>
    </section>

    <section class="" style="padding-bottom: 3rem">
      <div class="container is-max-desktop">
        <!-- Abstract. -->
        <div class="columns is-centered has-text-centered">
          <div class="column is-four-fifths">
            <h2 class="title is-3">Abstract</h2>
            <div class="content has-text-justified">
              <p>
                Existing image-to-3D creation methods typically split the task into multi-view image generation and 3D reconstruction, leading to two main limitations: (1) multi-view bias, where geometric inconsistencies arise because multi-view diffusion models ensure image-level rather than 3D consistency; (2) misaligned reconstruction data, since reconstruction models trained on mostly synthetic data misalign when processing generated multi-view images during inference. To address these issues, we propose Ouroboros3D, a unified framework that integrates multi-view generation and 3D reconstruction into a recursive diffusion process.
                By incorporating a 3D-aware feedback mechanism, our multi-view diffusion model leverages the explicit 3D information from the reconstruction results of the previous denoising process as conditions, thus modeling consistency at the 3D geometric level. Furthermore, through joint training of both the multi-view diffusion and reconstruction models, we alleviate reconstruction bias due to data misalignment and enable mutual enhancement within the multi-step recursive process. Experimental results demonstrate that Ouroboros3D outperforms methods that treat these stages separately and those that combine them only during inference, achieving superior multi-view consistency and producing 3D models with higher geometric realism.
              </p>
            </div>
          </div>
        </div>
        <!--/ Abstract. -->
      </div>
    </section>

    <section class="">
      <div class="container is-max-desktop">
        <div>
          <div class="is-centered has-text-centered">
            <h2 class="title is-3">Image-to-3D</h2>
          </div>
          <div class="hero-body" style="padding-bottom: 3rem">
            <video
              width="100%"
              playsinline=""
              autoplay="autoplay"
              loop="loop"
              preload=""
              muted="">
              <source src="./static/video/video_ood.mp4" type="video/mp4" />
            </video>
          </div>
        </div>
      </div>
    </section>

    <section class="">
      <div class="container is-max-desktop">
        <div>
          <div class="is-centered has-text-centered">
            <h2 class="title is-3">3D-aware Recursive Diffusion</h2>
          </div>
          <div
            class="hero-body"
            style="
              padding-bottom: 0;
              padding-left: 100px;
              padding-right: 100px;
            ">
            <img
              src="./static/images/pipeline-comparison.png"
              alt="concept comparison"
              height="100%" />
          </div>
          <div class="columns is-centered has-text-centered">
            <div class="column is-four-fifths">
              <div class="content has-text-justified">
                <p>
                  <b>Concept comparison</b> between Ouroboros3D and previous
                  two-stage methods. Instead of directly combining multi-view
                  diffusion model and reconstruction model, our self-conditioned
                  framework involves joint training of these two models and
                  establish them as a recursive association. At each step of the
                  denoising process, the rendered 3D-aware maps are fed to the
                  multi-view generation in the next step.
                </p>
              </div>
            </div>
          </div>

          <div
            style="
              padding-left: 100px;
              padding-right: 100px;
              padding-bottom: 0;
            ">
            <img
              src="./static/images/recursive-diffusion.png"
              alt="recursive diffusion"
              height="100%" />
          </div>
          <div class="columns is-centered has-text-centered">
            <div class="column is-four-fifths">
              <div class="content has-text-justified">
                <p>
                  <b>Concept of 3D-aware recursive diffusion.</b> During
                  multi-view denoising, the diffusion model uses 3D-aware maps
                  rendered by the reconstruction module at the previous step as
                  conditions.
                </p>
              </div>
            </div>
          </div>
          <div style="padding-bottom: 3rem">
            <video
              width="100%"
              playsinline=""
              autoplay="autoplay"
              loop="loop"
              preload=""
              muted="">
              <source
                src="./static/video/video-recursive.mp4"
                type="video/mp4" />
            </video>
          </div>
        </div>
      </div>
    </section>

    <section class="">
      <div class="container is-max-desktop">
        <!-- Method. -->
        <div>
          <div class="is-centered has-text-centered">
            <h2 class="title is-3">Method Overview</h2>
          </div>
          <div class="hero-body" style="padding-bottom: 0; padding-left: 100px">
            <img
              src="./static/images/pipeline.png"
              alt="Method Overview"
              height="100%" />
          </div>
          <div
            class="columns is-centered has-text-centered"
            style="padding-bottom: 3rem">
            <div class="column is-four-fifths">
              <div class="content has-text-justified">
                <p>
                  Overview of <span class="dnerf">Ouroboros3D</span>. In the
                  denoising sampling loop, we decode the predicted x0 to
                  noise-corrupted images, which are then used to recover 3D
                  representation by a feed-forward reconstruction model. Then
                  the rendered color images and coordinates maps are encoded and
                  fed into the next denoising step.
                </p>
              </div>
            </div>
          </div>
        </div>
        <!--/ Method. -->
      </div>
    </section>

    <section class="">
      <div class="container is-max-desktop">
        <div>
          <div class="is-centered has-text-centered">
            <h2 class="title is-3">Results on GSO Dataset</h2>
          </div>
          <div class="hero-body">
            <video
              width="100%"
              playsinline=""
              autoplay="autoplay"
              loop="loop"
              preload=""
              muted="">
              <source src="./static/video/gso_video.mp4" type="video/mp4" />
            </video>
          </div>
        </div>
      </div>

      <div class="container is-max-desktop">
        <div>
          <div class="is-centered has-text-centered">
            <h2 class="title is-3">More Results</h2>
          </div>
          <div class="hero-body">
            <video
              width="100%"
              playsinline=""
              autoplay="autoplay"
              loop="loop"
              preload=""
              muted="">
              <source src="./static/video/video-genshin.mp4" type="video/mp4" />
            </video>
          </div>
        </div>
      </div>
    </section>
  </body>
</html>
