<!DOCTYPE html>
<html>

<head lang="en">
    <meta charset="UTF-8">
    <meta http-equiv="x-ua-compatible" content="ie=edge">

    <title>DressRecon: Freeform 4D Human Reconstruction from Monocular Video</title>

    <meta name="description" content="">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css">
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.8.0/codemirror.min.css">
    <link rel="stylesheet" href="css/app.css">

    <link rel="stylesheet" href="css/bootstrap.min.css">

    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.8.0/codemirror.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/1.5.3/clipboard.min.js"></script>
    
    <script src="js/app.js"></script>
</head>

<body>
    <div class="container" id="main" style="max-width: 800px">
        <div class="row">
            <h2 class="col-md-12 text-center">
                DressRecon: Freeform 4D Human Reconstruction from Monocular Video<br>
                <small>3DV 2025 (Oral)</small>
            </h2>
            <center>
                <a href="https://jefftan969.github.io/">Jeff Tan</a>,
                <a href="https://xiangdonglai.github.io/">Donglai Xiang</a>,
                <a href="https://shubhtuls.github.io/">Shubham Tulsiani</a>,
                <a href="https://cs.cmu.edu/~deva/">Deva Ramanan</a>,
                <a href="https://gengshan-y.github.io/">Gengshan Yang</a>
                <br>
                Carnegie Mellon University
            </center>
        </div>

  
        <div class="row">
            <div class="col-md-6 col-md-offset-3 text-center">
                <ul class="nav nav-pills nav-justified">
                    <li>
                        <a href="https://arxiv.org/abs/2409.20563">
                        <image src="img/dressrecon_paper_image.png" height="60px">
                        <h4><strong>arXiv</strong></h4>
                        </a>
                    </li>
                    <!--
                    <li>
                        <a href="https://jefftan969.github.io/dressrecon/poster.pdf">
                        <image src="img/dressrecon_poster_image.png" height="30px" style="margin-top:30px">
                        <h4><strong>Poster</strong></h4>
                        </a>
                    </li>
                    <li>
                        <a href="https://www.youtube.com/watch?v=taUtXtW8b3Q">
                        <image src="img/youtube_icon.png" height="60px">
                        <h4><strong>Video</strong></h4>
                        </a>
                    </li>
                    -->
                    <li>
                        <a href="https://github.com/jefftan969/dressrecon">
                        <image src="img/github_icon.png" height="60px">
                        <h4><strong>Code</strong></h4>
                        </a>
                    </li>
                </ul>
            </div>
        </div>


        <div class="row">
            <div class="col-md-12">
                <center>
                    <video id="teaser_video" width="100%" autoplay loop muted>
                        <source src="./img/teaser.mp4" type="video/mp4" />
                    </video>
                </center>

                <h3>Abstract</h3>
                <p class="text-justify">
                We present a method to reconstruct time-consistent human body models from monocular videos, focusing on extremely loose clothing or handheld object interactions. Prior work in human reconstruction is either limited to tight clothing with no object interactions, or requires calibrated multi-view captures or personalized template scans which are costly to collect at scale. Our key insight for high-quality yet flexible reconstruction is the careful combination of generic human priors about articulated body shape (learned from large-scale training data) with video-specific articulated ``bag-of-bones" deformation (fit to a single video via test-time optimization). We accomplish this by learning a neural implicit model that disentangles body versus clothing deformations as separate motion model layers. To capture subtle geometry of clothing, we leverage image-based priors such as human body pose, surface normals, and optical flow during optimization. The resulting neural fields can be extracted into time-consistent meshes, or further optimized as explicit 3D Gaussians for high-fidelity interactive rendering. On datasets with highly challenging clothing deformations and object interactions, DressRecon yields higher-fidelity 3D reconstructions than prior art.
                </p>

                <h3>Method</h3>
                The body shape is a neural signed distance field in the canonical space. During volume rendering, rays at time t are traced back to the canonical space via a deformation field.

                <h4>Hierarchical Deformation</h4>
                Hierarchical motion fields, represented by body and clothing Gaussians, warp between the canonical shape and time t. The motion fields capture limb motions as well as fine-grained clothing deformations. Using a two-layer model allows us to initialize body pose from off-the-shelf estimates. Below, the woman's arms stop moving when body deformation is removed (middle).
                <br>
                <center>
                    <video id="hierarchical_deformation" width="100%" autoplay loop muted>
                        <source src="./img/hierarchical_deformation.mp4" type="video/mp4" />
                    </video>
                </center>

                <h4>Image-Based Priors</h4>
                To capture subtle geometry and make optimization tractable, we use foundational image-based priors as supervision, including surface normals, optical flow, universal features, segmentation masks, and 3D human body pose. Each observation below contributes an additional loss term.
                <br>
                <center>
                    <video id="image_based_priors" width="100%" autoplay loop muted>
                        <source src="./img/image_based_priors.mp4" type="video/mp4" />
                    </video>
                </center>

                <h4>Refinement with 3D Gaussians</h4>
                The resulting neural fields can be extracted into time-consistent meshes, or further optimized as explicit 3D Gaussians to improve the rendering quality and enable interactive visualization. Below, refinement from an implicit SDF to 3D Gaussians improves texture quality.
                <br>
                <center>
                    <video id="gaussian_refinement" width="100%" autoplay loop muted>
                        <source src="./img/gaussian_refinement.mp4" type="video/mp4" >
                    </video>
                </center>

                <h3>Results</h3>
                DressRecon reconstructs high-fidelity shapes and motions even in challenging scenarios. Below, we show the reconstructed shape, rendered 3D point tracks, 3D Gaussian locations after refinement, input-view RGB renderings, and input monocular videos on DNA-Rendering sequences.
                <br>

                <button id="results_1_shape">Shape</button>
                <button id="results_1_tracks">3D Tracks</button>
                <button id="results_1_gaussians">3D Gaussians</button>
                <button id="results_1_rgb">Rendered RGB</button>
                <button id="results_1_ref_rgb">Input Monocular Video</button>
                <br>
                <button id="results_1_group0">⠀1⠀</button>
                <button id="results_1_group1">⠀2⠀</button>
                <button id="results_1_group2">⠀3⠀</button>
                <button id="results_1_group3">⠀4⠀</button>
                <button id="results_1_group4">⠀5⠀</button>
                <center>
                    <video id="results_1" width="80%" autoplay loop muted>
                        <source src="" type="video/mp4" >
                    </video>
                </center>

                <button id="results_2_shape">Shape</button>
                <button id="results_2_tracks">3D Tracks</button>
                <button id="results_2_gaussians">3D Gaussians</button>
                <button id="results_2_rgb">Rendered RGB</button>
                <button id="results_2_ref_rgb">Input Monocular Video</button>
                <br>
                <button id="results_2_group0">⠀1⠀</button>
                <button id="results_2_group1">⠀2⠀</button>
                <button id="results_2_group2">⠀3⠀</button>
                <button id="results_2_group3">⠀4⠀</button>
                <button id="results_2_group4">⠀5⠀</button>
                <center>
                    <video id="results_2" width="80%" autoplay loop muted>
                        <source src="" type="video/mp4" >
                    </video>
                </center>

                <h3>Extreme View Synthesis</h3>
                The reconstructed avatars can be rendered from any view. Given the input monocular video on the left, we show four novel-view renderings at extreme views.
                <br>
                <button id="extreme_view_synthesis_0008_01">0008_01</button>
                <button id="extreme_view_synthesis_0047_12">0047_12</button>
                <button id="extreme_view_synthesis_0113_06">0113_06</button>
                <button id="extreme_view_synthesis_0121_02">0121_02</button>
                <button id="extreme_view_synthesis_0123_02">0123_02</button>
                <button id="extreme_view_synthesis_0128_04">0128_04</button>
                <button id="extreme_view_synthesis_0152_01">0152_01</button>
                <button id="extreme_view_synthesis_0188_02">0188_02</button>
                <center>
                    <video id="extreme_view_synthesis" width="100%" autoplay loop muted>
                        <source src="" type="video/mp4" >
                    </video>
                </center>

                <h3>Motion Decomposition</h3>
                The body and clothing deformation layers are evenly distributed in space, and are often responsible for separate types of motion. Below, we remove each motion type from the reconstructed avatar. Clothing Gaussians are yellow and body Gaussians are blue.
                <br>
                <button id="motion_decomposition_0008_01">0008_01</button>
                <button id="motion_decomposition_0047_01">0047_01</button>
                <button id="motion_decomposition_0047_12">0047_12</button>
                <button id="motion_decomposition_0102_02">0102_02</button>
                <button id="motion_decomposition_0113_06">0113_06</button>
                <button id="motion_decomposition_0121_02">0121_02</button>
                <button id="motion_decomposition_0123_02">0123_02</button>
                <button id="motion_decomposition_0128_04">0128_04</button>
                <button id="motion_decomposition_0152_01">0152_01</button>
                <button id="motion_decomposition_0188_02">0188_02</button>
                <center>
                    <video id="motion_decomposition" width="100%" autoplay loop muted>
                        <source src="" type="video/mp4" >
                    </video>
                </center>

                <h3>Baseline Comparisons</h3>
                We compare DressRecon's shape with several baselines, on DNA-Rendering sequences that contain challenging clothing deformation and handheld objects. DressRecon is able to reconstruct challenging deformable structures with higher fidelity than prior art.
                <br>
                <button id="baseline_comparison_0008_01">0008_01</button>
                <button id="baseline_comparison_0047_01">0047_01</button>
                <button id="baseline_comparison_0047_12">0047_12</button>
                <button id="baseline_comparison_0102_02">0102_02</button>
                <button id="baseline_comparison_0113_06">0113_06</button>
                <button id="baseline_comparison_0121_02">0121_02</button>
                <button id="baseline_comparison_0123_02">0123_02</button>
                <button id="baseline_comparison_0128_04">0128_04</button>
                <button id="baseline_comparison_0152_01">0152_01</button>
                <button id="baseline_comparison_0188_02">0188_02</button>
                <center>
                    <video id="baseline_comparison" width="100%" autoplay loop muted>
                        <source src="" type="video/mp4" >
                    </video>
                </center>

                <h3>Bibtex</h3>
                <pre>
@article{tan2024dressrecon,
    title={DressRecon: Freeform 4D Human Reconstruction from Monocular Video},
    author={Tan, Jeff and Xiang, Donglai and Tulsiani, Shubham and Ramanan, Deva and Yang, Gengshan},
    journal={arXiv preprint arXiv:2409.20563},
    year={2024}
}</pre>

                <h3>Acknowledgments</h3>
                <p class="text-justify">
                The website template was borrowed from <a href="http://jonbarron.info/">Jon Barron</a>.
                </p>
            </div>
        </div>
    </div>
</body>
</html>
