<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="description"
        content="iLRM: An Iterative Large 3D Reconstruction Model.">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>iLRM</title>

  <script>
    window.dataLayer = window.dataLayer || [];

    function gtag() {
      dataLayer.push(arguments);
    }

    gtag('js', new Date());

    gtag('config', 'G-PYVRSFMDRL');
  </script>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
        rel="stylesheet">

  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">
  <link rel="icon" href="./static/images/favicon.svg">

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>
</head>
<body>


<body>
  <div class="header-wrapper">
    <div class="header-container" id="header-container">
      <div class="header-content">
        <h1 style="font-family: 'CookieRun', serif; font-size: 5rem; font-weight: bold; margin-bottom: 0.1rem;">
           <span style="color: rgb(230, 100, 80);">i</span><span style="color: rgb(230, 183, 53);">L</span><span style="color: rgb(117, 160, 85);">R</span><span style="color: rgb(96, 120, 172);">M</span>
        </h1>
        <h1 class="title is-1 publication-title" style="font-weight: bold; margin-bottom: 0.5rem; color: rgb(35, 33, 30);">An Iterative Large 3D Reconstruction Model</h1>        
        <div class="is-size-5 publication-authors" style="margin-bottom: 1rem; color: black">
          Anonymous Authors</span>
        </div>
      </div> 
      <div class="header-video">
          <video id="shiba" autoplay muted loop playsinline height="100%" draggable="false">
            <source src="./static/video/tiger.mp4" type="video/mp4">
          </video>      
      </div>
    </div>
  </div>

<section class="hero teaser">
  <div class="container is-max-desktop" style="margin-top: 2rem;">
    <div class="hero-body">
      <h1 class="subtitle has-text-centered" style="margin-top: -25px; font-size: 2rem">
        <span style="text-align: center; font-size: 70%"><b>Large-scale, high-resolution 3D Gaussian scene reconstruction in just <span style="color: red">0.5 seconds</span></b>.</span>
      </h1>      
      <h1 class="subtitle has-text-centered" style="margin-top: 25px; font-size: 2rem; margin-bottom: -2px;">
        <b>iLRM Overview</b> <br>      
      </h1>
      <img src="static/images/teaser.webp" class="center" style="margin-bottom: -20px;">
    </div>
  </div>
</section>

<section class="section">
  <div class="container is-max-desktop">  
    <!-- Abstract. -->
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Abstract</h2>
        <div class="content has-text-justified">
          <p>
            Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction.
            In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, 
            has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications.
            However, many state-of-the-art methods, primarily based on transformer architectures, 
            suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, 
            resulting in prohibitive computational costs as the number of views or image resolution increases.
          </p>
          <p>
            Toward a scalable and efficient feed-forward 3D reconstruction, 
            we introduce an iterative Large 3D Reconstruction Model (<b><i>iLRM</i></b>) 
            that generates 3D Gaussian representations through an iterative refinement mechanism, 
            guided by three core principles: 
            (1) decoupling the scene representation from input-view images to enable <b><i>compact 3D representations</i></b>; 
            (2) decomposing fully-attentional multi-view interactions into a <b><i>two-stage attention</i></b> scheme to reduce computational costs; 
            and (3) injecting <b><i>high-resolution information at every layer</i></b> to achieve high-fidelity reconstruction.
            Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that <b><i>iLRM</i></b> outperforms existing methods in both reconstruction quality and speed.            
          </p>
        </div>
      </div>
    </div>
  </div>
</section>


<section class="hero is-light is-small">
  <div class="hero-body">
    <div class="container">
      <div id="results-carousel" class="carousel results-carousel" data-slides-to-show="2">
        <div class="item item-steve">
          <video poster="" id="steve" autoplay controls muted loop playsinline height="100%">
            <source src="./static/video/flower.mp4"
                    type="video/mp4">
          </video>
        </div>
        <div class="item item-fullbody">
          <video poster="" id="fullbody" autoplay controls muted loop playsinline height="100%">
            <source src="./static/video/kit.mp4"
                    type="video/mp4">
          </video>
        </div>        
        <div class="item item-chair-tp">
          <video poster="" id="chair-tp" autoplay controls muted loop playsinline height="100%">
            <source src="./static/video/bin.mp4"
                    type="video/mp4">
          </video>
        </div>
        <div class="item item-shiba">
          <video poster="" id="shiba" autoplay controls muted loop playsinline height="100%">
            <source src="./static/video/shop.mp4"
                    type="video/mp4">
          </video>
        </div>
        <div class="item item-fullbody">
          <video poster="" id="fullbody" autoplay controls muted loop playsinline height="100%">
            <source src="./static/video/build.mp4"
                    type="video/mp4">
          </video>
        </div>    
        <div class="item item-fullbody">
          <video poster="" id="fullbody" autoplay controls muted loop playsinline height="100%">
            <source src="./static/video/wedding.mp4"
                    type="video/mp4">
          </video>
        </div>   
        <div class="item item-fullbody">
          <video poster="" id="fullbody" autoplay controls muted loop playsinline height="100%">
            <source src="./static/video/buda.mp4"
                    type="video/mp4">
          </video>
        </div>   
        <div class="item item-fullbody">
          <video poster="" id="fullbody" autoplay controls muted loop playsinline height="100%">
            <source src="./static/video/star.mp4"
                    type="video/mp4">
          </video>
        </div>                                
      </div>
      <div class="has-text-centered">
        <p class="subtitle is-5">Zero-shot inference results on the DL3DV dataset using 32 input images with a resolution of 540×960.</p>
    </div>
  </div>
</section>



<section class="section">
  <div class="container is-max-desktop">

    <!-- Animation. -->
    <div class="columns is-centered">
      <div class="column is-full-width">
        <h2 class="title is-3">Core architectural design</h2>
        <div class="content has-text-justified">
          <p>
            Our method decouples scene representation from input-view images, 
            enabling efficient computation and compact 3D reconstruction. 
            The example above uses half-resolution views, 
            significantly reducing the attention cost while maintaining high-quality reconstruction.
          </p>
        </div>        
        <img src="static/images/eff_attn.webp" class="center" style="margin-top: -1rem;">
      </div>
    </div>

    <!-- Animation. -->
    <div class="columns is-centered">
      <div class="column is-full-width">
        <h2 class="title is-3" style="margin-top: 0.5rem">Comparison</h2>

        <h3 class="title is-4">RealEstate10K</h3>
        <div class="content has-text-justified">
          <p>
            We compare our method with the state-of-the-art methods on the RealEstate10K dataset with various numbers of input images using a single NVIDIA RTX 4090 GPU.
          </p>
          <img src="static/images/re10k_table.jpg" class="center"> 
          <img src="static/images/re10k_page.jpg" class="center">   
          <video id="replay-video"
                 controls
                 muted
                 preload
                 autoplay
                 loop
                 playsinline>
            <source src="./static/video/re10k_page_30.mp4"
                    type="video/mp4">
          </video>          
        </div>   

        <h3 class="title is-4" style="margin-top: 1rem">DL3DV low-resolution (256x448)</h3>
        <div class="content has-text-justified">
          <p>
            We compare our method with the state-of-the-art methods on the DL3DV dataset using a single NVIDIA RTX 4090 GPU.
            The quantitative results are shown in the table below, performed with a 50-frame coverage following the DepthSplat protocol.
            The qualitative comparisons are presented in the accompanying video, 
            with both methods using 24 input images and evaluated over full-frame coverage.
            We also show the encoding time and memory consumption of each method.
            <b>Note that, our method generates only 1/4 Gaussians compared to the baseline method.</b>
          </p>          
          <img src="static/images/dl3dv_lr_table.jpg" class="center">           
        </div>
        <div class="content has-text-centered">
          <video id="replay-video"
                 controls
                 muted
                 preload
                 playsinline
                 width="96%">
            <source src="./static/video/lr24.mp4" type="video/mp4">
          </video>
        </div>   

        <!-- Re-rendering. -->
        <h3 class="title is-4">DL3DV high-resolution (512x960)</h3>
        <div class="content has-text-justified">
          <p>
            We compare our method with the state-of-the-art methods on the DL3DV dataset using a single NVIDIA RTX 4090 GPU.
            In qualitative comparisons, both methods use 12 input images at a resolution of 512x960, covering a 100-frame interval.
            Since DepthSplat encounters out-of-memory issue on the device, we evaluate its performance using a single H100 GPU.
            <b>Note that, our method generates only 1/4 Gaussians compared to the baseline method.</b>
          </p>
          <img src="static/images/dl3dv_hr_table.jpg" class="center"> 
        </div>
        <div class="content has-text-centered">
          <div class="video-slider-container">
            <div class="video-slider-wrapper">
              <video id="video-left" autoplay muted loop playsinline>
                <source src="static/video/ours_slider.mp4" type="video/mp4">
              </video>
              <video id="video-right" autoplay muted loop playsinline>
                <source src="static/video/ds_slider.mp4" type="video/mp4">
              </video>
              <div id="slider-bar"></div>
              <div id="slider-handle"></div>    
              <div class="video-label video-label-left">iLRM (Ours)</div>
              <div class="video-label video-label-right">DepthSplat</div>    
              <!-- 🟡 Play/Pause Button -->
              <button id="toggle-play" class="video-toggle-button">
              <img id="toggle-icon" src="static/images/pause.png" alt="Pause" />
              </button>
            </div>    
          </div>          
        </div>        

        <h3 class="title is-4" style="margin-top: 1rem">DL3DV high-resolution, wide-baseline (540x960, Undistorted)</h3>
        <div class="content has-text-justified">
          <p>
            We compare our method against the current state-of-the-art wide-coverage feed-forward 3D reconstruction model, LongLRM, 
            as well as optimization-based methods, 3D-GS and Mip-Splatting. 
            <em>Undistorted</em> refers to the undistorted version of the DL3DV dataset.
            The quantitative results are presented in the table below, evaluated with 32 input images under full-frame coverage following the LongLRM protocol.
            Both 3D-GS and Mip-Splatting are trained for 30k iterations.
            LongLRM<span style="vertical-align: sub; font-size: smaller;">10</span> means finetuning 10 epochs initialized from the LongLRM’s generated Gaussians. 
            Since we generate more compact 3D Gaussian representations, our finetuning converges much faster than LongLRM. We utilized FlashAttention-3 for zero-shot inference.
          </p>         
          <img src="static/images/wide_table.jpg" class="center">  
          <img src="static/images/graph.png" class="center">  
        </div>
      </div>
    </div>
  </div>
</section>


<script>
  window.addEventListener("DOMContentLoaded", () => {
    const sliderBar = document.getElementById('slider-bar');
    const videoRight = document.getElementById('video-right');
    const videoLeft = document.getElementById('video-left');
    const wrapper = document.querySelector('.video-slider-wrapper');
    const sliderHandle = document.getElementById('slider-handle');
    
    wrapper.addEventListener('mousemove', (e) => {
      const bounds = wrapper.getBoundingClientRect();
      const offsetX = e.clientX - bounds.left;
      const percent = offsetX / bounds.width;
      const clipPercent = Math.min(100, Math.max(0, percent * 100));
      videoRight.style.clipPath = `inset(0 0 0 ${clipPercent}%)`;
      sliderBar.style.left = `${clipPercent}%`;
      sliderHandle.style.left = `${clipPercent}%`;
    });

    wrapper.addEventListener('mouseleave', () => {
      videoRight.style.clipPath = `inset(0 0 0 50%)`;
      sliderBar.style.left = `50%`;
      sliderHandle.style.left = `50%`;
    });

    // ▶️ Pause/Play toggle
    const toggleBtn = document.getElementById("toggle-play");
    const icon = document.getElementById("toggle-icon");
    let isPlaying = true;

    toggleBtn.addEventListener("click", () => {
      if (isPlaying) {
        videoLeft.pause();
        videoRight.pause();
        icon.src = "static/images/play.png";
        icon.alt = "Play";
      } else {
        videoLeft.play();
        videoRight.play();
        icon.src = "static/images/pause.png";
        icon.alt = "Pause";
      }
      isPlaying = !isPlaying;
    });
  });
</script>

</section>

</body>
</html>
