<!DOCTYPE html>

<head>
    <meta charset="utf-8">
    <title>JOG3R</title>
    <link rel="stylesheet" href="css/style.css">
    <link rel="stylesheet" href="css/slider.css">
    <link href="https://fonts.googleapis.com/css?family=Pacifico" rel="stylesheet">
</head>

<body>
    <div id="body">
        <h1 id="title">JOG3R: Camera Pose Estimation <br> Emerging In Video Diffusion Transformer</h1>
        <h3 id="conference">ICLR 2025 <br> SUBMISSION 
    </h3>
        <p style="max-width:700px; margin:auto; text-align: justify; margin-bottom: 1em">
            In this HTML document we show video results and 3D camera reconstruction results. Videos should play automatically and in a loop. The webpage was tested with Chrome browser on a 2K resolution display. In case you observe page cutoff, please zoom out in the browser. 
        </p>
        <div id="content">
            <details>
                <summary>V2C: JOG3R (ours) vs. ours w/o generation loss vs. DUSt3R </summary>
                <p style="font-size:20px;">
                    <b>Our method generates more accurate correspondences and camera trajectories compared to DUSt3R. <br> 
                        We also compare with our method without generation loss.<br> 
                    For each pair of frames, we visualize only 10 correspondences to avoid clutter.</b>
                </p>
                <details>
                    <!-- <summary>Pigs, edit prompt: <span style="color: forestgreen;">a group of chocolate pigs looking for food</span></summary> -->
                    <summary>da80d87326bf63b7 (JOG3R vs. DUSt3R)</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/3_da80d87326bf63b7_slow.gif" height="300">
                                <div>
                                    ours correspondences
                                </div>
                            </th>
                            
                            <th>
                                <img src="assets/v2c/dust3r_pretrained/3_da80d87326bf63b7_slow.gif" height="300">
                                <div>
                                    DUSt3R's correspondences <br>(note the drifting of 2nd to the last line)
                                </div>
                            </th>
                        </tr>
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/3_da80d87326bf63b7_trajectory.jpg" height="350">
                                <div>
                                    our camera trajectory
                                </div>
                            </th>
                            <th>
                                <img src="assets/v2c/dust3r_pretrained/3_da80d87326bf63b7_trajectory.jpg" height="350">
                                <div>
                                    DUSt3R's camera trajectory
                                </div>
                            </th>
                        </tr>
                    </table>
                </details>
                <details>
                    <summary>fb52f951d8a8ad11 (JOG3R vs. DUSt3R)</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/6_fb52f951d8a8ad11_slow.gif" height="300">
                                <div>
                                    ours correspondences
                                </div>
                            </th>
                            
                            <th>
                                <img src="assets/v2c/dust3r_pretrained/6_fb52f951d8a8ad11_slow.gif" height="300">
                                <div>
                                    DUSt3R's correspondences 
                                </div>
                            </th>
                        </tr>
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/6_fb52f951d8a8ad11_trajectory.jpg" height="350">
                                <div>
                                    our camera trajectory
                                </div>
                            </th>
                            <th>
                                <img src="assets/v2c/dust3r_pretrained/6_fb52f951d8a8ad11_trajectory.jpg" height="350">
                                <div>
                                    DUSt3R's camera trajectory
                                </div>
                            </th>
                        </tr>
                      </table>
                </details>
                <details>
                    <summary>26fe74c70177d694 (JOG3R vs. DUSt3R)</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/0_26fe74c70177d694_slow.gif" height="300">
                                <div>
                                    ours correspondences
                                </div>
                            </th>
                            
                            <th>
                                <img src="assets/v2c/dust3r_pretrained/0_26fe74c70177d694_slow.gif" height="300">
                                <div>
                                    DUSt3R's correspondences
                                </div>
                            </th>
                        </tr>
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/0_26fe74c70177d694_trajectory.jpg" height="350">
                                <div>
                                    our camera trajectory <br> (camera moves only rightwards)
                                </div>
                            </th>
                            <th>
                                <img src="assets/v2c/dust3r_pretrained/0_26fe74c70177d694_trajectory.jpg" height="350">
                                <div>
                                    DUSt3R's camera trajectory <br> (camera jitters around)
                                </div>
                            </th>
                        </tr>
                    </table>
                </details>
                <details>
                    <summary>e0577a912fd116ea (JOG3R vs. DUSt3R)</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/1_e0577a912fd116ea_slow.gif" height="300">
                                <div>
                                    ours correspondences
                                </div>
                            </th>
                            
                            <th>
                                <img src="assets/v2c/dust3r_pretrained/1_e0577a912fd116ea_slow.gif" height="300">
                                <div>
                                    DUSt3R's correspondences
                                </div>
                            </th>
                        </tr>
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/1_e0577a912fd116ea_trajectory.jpg" height="350">
                                <div>
                                    our camera trajectory
                                </div>
                            </th>
                            <th>
                                <img src="assets/v2c/dust3r_pretrained/1_e0577a912fd116ea_trajectory.jpg" height="350">
                                <div>
                                    DUSt3R's camera trajectory (doesn't move horizontally)
                                </div>
                            </th>
                        </tr>
                      </table>
                </details>
                <details>
                    <!-- <summary>Pigs, edit prompt: <span style="color: forestgreen;">a group of chocolate pigs looking for food</span></summary> -->
                    <summary>1de1b73fe4d6aa77 (JOG3R vs. JOG3R w/o generation loss)</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/200_1de1b73fe4d6aa77_slow.gif" height="300">
                                <div>
                                    ours
                                </div>
                            </th>
                            
                            <th>
                                <img src="assets/v2c/ours_nogenloss/200_1de1b73fe4d6aa77_slow.gif" height="300">
                                <div>
                                    ours w/o gen loss
                                </div>
                            </th>
                        </tr>
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/200_1de1b73fe4d6aa77_trajectory.jpg" height="350">
                                <div>
                                    our camera trajectory
                                </div>
                            </th>
                            <th>
                                <img src="assets/v2c/ours_nogenloss/200_1de1b73fe4d6aa77_trajectory.jpg" height="350">
                                <div>
                                    ours w/o gen loss <br>
                                    (camera moves as a straight line, no curvry trajectory; <br> 
                                    the green camera has a sudden jump.)
                                </div>
                            </th>
                        </tr>
                    </table>
                </details>
                <details>
                    <!-- <summary>Pigs, edit prompt: <span style="color: forestgreen;">a group of chocolate pigs looking for food</span></summary> -->
                    <summary>d48b66d36ec83707 (JOG3R vs. JOG3R w/o generation loss)</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/1500_d48b66d36ec83707_slow.gif" height="300">
                                <div>
                                    ours
                                </div>
                            </th>
                            
                            <th>
                                <img src="assets/v2c/ours_nogenloss/1500_d48b66d36ec83707_slow.gif" height="300">
                                <div>
                                    ours w/o gen loss
                                </div>
                            </th>
                        </tr>
                        <tr>
                            <th>
                                <img src="assets/v2c/ours/1500_d48b66d36ec83707_trajectory.jpg" height="350">
                                <div>
                                    our camera trajectory <br>
                                    (camera moves only forward)
                                </div>
                            </th>
                            <th>
                                <img src="assets/v2c/ours_nogenloss/1500_d48b66d36ec83707_trajectory.jpg" height="350">
                                <div>
                                    ours w/o gen loss <br>
                                    (camera jitters back and forth).
                                </div>
                            </th>
                        </tr>
                    </table>
                </details>
            </details>
            <details> 
                <summary>T2V: JOG3R (ours) vs. ours w/o reconstruction loss</summary>
                <p style="font-size:20px;">
                    <b>We compare with a variant trained w/o reconstruction loss and show that reconstruction loss helps generation.</b>
                </p>
                <details>
                    <summary>an empty basement with wood paneling on the walls</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <video autoplay muted loop height="500">
                                    <source src="assets/t2v/ours/sample_7.mp4">
                                </video>
                                <div>
                                    ours<br> 
                                </div>
                            </th>
                            <th>
                                <video autoplay muted loop height="500">
                                    <source src="assets/t2v/sora_finetuned/sample_7.mp4">
                                </video>
                                <div>
                                    ours w/o reconstruction loss <br> (quality degradation)
                                </div>
                            </th>
                        </tr>
                    </table>
                </details>  
                <details>
                    <summary>an outdoor swimming pool surrounded by rocks and lounge chairs</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <video autoplay muted loop height="500">
                                    <source src="assets/t2v/ours/sample_170.mp4">
                                </video>
                                <div>
                                    ours<br> 
                                </div>
                            </th>
                            <th>
                                <video autoplay muted loop height="500">
                                    <source src="assets/t2v/sora_finetuned/sample_170.mp4">
                                </video>
                                <div>
                                    ours w/o reconstruction loss <br> (noticeable artifacts, no camera motion)
                                </div>
                            </th>
                        </tr>
                    </table>
                </details>  
                <details>
                    <summary>a dining room table with chairs and a vase of flowers</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <video autoplay muted loop height="500">
                                    <source src="assets/t2v/ours/sample_55.mp4">
                                </video>
                                <div>
                                    ours<br> 
                                </div>
                            </th>
                            <th>
                                <video autoplay muted loop height="500">
                                    <source src="assets/t2v/sora_finetuned/sample_55.mp4">
                                </video>
                                <div>
                                    ours w/o reconstruction loss <br> (left chair has artifacts)
                                </div>
                            </th>
                        </tr>
                    </table>
                </details>  
                <details>
                    <summary>a living room with a couch, coffee table, and entertainment center</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <video autoplay muted loop height="500">
                                    <source src="assets/t2v/ours/sample_71.mp4">
                                </video>
                                <div>
                                    ours <br> 
                                </div>
                            </th>
                            <th>
                                <video autoplay muted loop height="500">
                                    <source src="assets/t2v/sora_finetuned/sample_71.mp4">
                                </video>
                                <div>
                                    ours w/o reconstruction loss <br> (deforming artifacts appearing on the left at the end)
                                </div>
                            </th>
                        </tr>
                    </table>
                </details>  
                <details>
                    <summary>a laundry room with a washer and dryer in it</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <video autoplay muted loop height="500">
                                    <source src="assets/t2v/ours/sample_9.mp4">
                                </video>
                                <div>
                                    ours<br> 
                                </div>
                            </th>
                            <th>
                                <video autoplay muted loop height="500">
                                    <source src="assets/t2v/sora_finetuned/sample_9.mp4">
                                </video>
                                <div>
                                    ours w/o reconstruction loss <br> (implausible wash machine configuration)
                                </div>
                            </th>
                        </tr>
                    </table>
                </details>  
            </details>
            <details> 
                <summary>T2V+C</summary>
                <p style="font-size:20px;">
                    <b>All videos in this section are generated from JOG3R. <br> 
                        Our T2V+C pipeline can reconstruct 3D cameras consistent with T2V->V2C.<br> 
                    For each pair of frames, we visualize only 10 correspondences to avoid clutter.</b>
                </p>
                <!-- <img src="assets/imgs/table_quantitative.png"> -->
                <details>
                    <summary>a living room with leather chairs and guitars</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <img src="assets/t2vc/00_slow.gif" height="300">
                                <div>
                                    correspondences from T2V+C 
                                </div>
                            </th>
                            
                            <th>
                                <img src="assets/t2vv2c/0_a_living_room_with_leather_chairs_and_guitars_slow.gif" height="300">
                                <div>
                                    correspondences from T2V->V2C
                                </div>
                            </th>
                        </tr>
                        <tr>
                            <th>
                                <img src="assets/t2vc/00_trajectory.jpg" height="350">
                                <div>
                                    camera poses from T2V+C
                                </div>
                            </th>
                            <th>
                                <img src="assets/t2vv2c/0_a_living_room_with_leather_chairs_and_guitars_trajectory.jpg" height="350">
                                <div>
                                    camera poses from T2V->V2C
                                </div>
                            </th>
                        </tr>
                      </table>
                </details>
                <details>
                    <summary>a backyard with steps leading up to a blue house</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <img src="assets/t2vc/0_a_backyard_with_steps_leading_up_to_a_blue_house_slow.gif" height="300">
                                <div>
                                    correspondences from T2V+C 
                                </div>
                            </th>
                            
                            <th>
                                <img src="assets/t2vv2c/0_03_slow.gif" height="300">
                                <div>
                                    correspondences from T2V->V2C
                                </div>
                            </th>
                        </tr>
                        <tr>
                            <th>
                                <img src="assets/t2vc/0_a_backyard_with_steps_leading_up_to_a_blue_house_trajectory.jpg" height="350">
                                <div>
                                    camera poses from T2V+C
                                </div>
                            </th>
                            <th>
                                <img src="assets/t2vv2c/0_03_trajectory.jpg" height="350">
                                <div>
                                    camera poses from T2V->V2C
                                </div>
                            </th>
                        </tr>
                      </table>
                </details>
                <details>
                    <summary>a hallway leading to a bathroom and bedroom</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <img src="assets/t2vc/01_slow.gif" height="300">
                                <div>
                                    correspondences from T2V+C 
                                </div>
                            </th>
                            
                            <th>
                                <img src="assets/t2vv2c/0_a_hallway_leading_to_a_bathroom_and_bedroom_slow.gif" height="300">
                                <div>
                                    correspondences from T2V->V2C
                                </div>
                            </th>
                        </tr>
                        <tr>
                            <th>
                                <img src="assets/t2vc/01_trajectory.jpg" height="350">
                                <div>
                                    camera poses from T2V+C
                                </div>
                            </th>
                            <th>
                                <img src="assets/t2vv2c/0_a_hallway_leading_to_a_bathroom_and_bedroom_trajectory.jpg" height="350">
                                <div>
                                    camera poses from T2V->V2C
                                </div>
                            </th>
                        </tr>
                      </table>
                </details>
                <details>
                    <summary>an aerial view of a large house on the water</summary>
                    <table style="width: 100%;margin-left:auto;margin-right:auto;">
                        <tr>
                            <th>
                                <img src="assets/t2vc/02_slow.gif" height="300">
                                <div>
                                    correspondences from T2V+C 
                                </div>
                            </th>
                            
                            <th>
                                <img src="assets/t2vv2c/0_an_aerial_view_of_a_large_house_on_the_water_slow.gif" height="300">
                                <div>
                                    correspondences from T2V->V2C
                                </div>
                            </th>
                        </tr>
                        <tr>
                            <th>
                                <img src="assets/t2vc/02_trajectory.jpg" height="350">
                                <div>
                                    camera poses from T2V+C
                                </div>
                            </th>
                            <th>
                                <img src="assets/t2vv2c/0_an_aerial_view_of_a_large_house_on_the_water_trajectory.jpg" height="350">
                                <div>
                                    camera poses from T2V->V2C
                                </div>
                            </th>
                        </tr>
                      </table>
                </details>
            </details>
        </div>
    </div>
    <script type="text/javascript" src="script.js"></script>
    <!-- <script type="text/javascript" src="cocoen.js"></script> -->
    <!-- <script>
      Cocoen.parse(document.body);
    </script> -->
</body>