JOG3R: Camera Pose Estimation
Emerging In Video Diffusion Transformer

ICLR 2025
SUBMISSION

In this HTML document we show video results and 3D camera reconstruction results. Videos should play automatically and in a loop. The webpage was tested with Chrome browser on a 2K resolution display. In case you observe page cutoff, please zoom out in the browser.

V2C: JOG3R (ours) vs. ours w/o generation loss vs. DUSt3R

Our method generates more accurate correspondences and camera trajectories compared to DUSt3R.
We also compare with our method without generation loss.
For each pair of frames, we visualize only 10 correspondences to avoid clutter.

da80d87326bf63b7 (JOG3R vs. DUSt3R)
ours correspondences
DUSt3R's correspondences
(note the drifting of 2nd to the last line)
our camera trajectory
DUSt3R's camera trajectory
fb52f951d8a8ad11 (JOG3R vs. DUSt3R)
ours correspondences
DUSt3R's correspondences
our camera trajectory
DUSt3R's camera trajectory
26fe74c70177d694 (JOG3R vs. DUSt3R)
ours correspondences
DUSt3R's correspondences
our camera trajectory
(camera moves only rightwards)
DUSt3R's camera trajectory
(camera jitters around)
e0577a912fd116ea (JOG3R vs. DUSt3R)
ours correspondences
DUSt3R's correspondences
our camera trajectory
DUSt3R's camera trajectory (doesn't move horizontally)
1de1b73fe4d6aa77 (JOG3R vs. JOG3R w/o generation loss)
ours
ours w/o gen loss
our camera trajectory
ours w/o gen loss
(camera moves as a straight line, no curvry trajectory;
the green camera has a sudden jump.)
d48b66d36ec83707 (JOG3R vs. JOG3R w/o generation loss)
ours
ours w/o gen loss
our camera trajectory
(camera moves only forward)
ours w/o gen loss
(camera jitters back and forth).
T2V: JOG3R (ours) vs. ours w/o reconstruction loss

We compare with a variant trained w/o reconstruction loss and show that reconstruction loss helps generation.

an empty basement with wood paneling on the walls
ours
ours w/o reconstruction loss
(quality degradation)
an outdoor swimming pool surrounded by rocks and lounge chairs
ours
ours w/o reconstruction loss
(noticeable artifacts, no camera motion)
a dining room table with chairs and a vase of flowers
ours
ours w/o reconstruction loss
(left chair has artifacts)
a living room with a couch, coffee table, and entertainment center
ours
ours w/o reconstruction loss
(deforming artifacts appearing on the left at the end)
a laundry room with a washer and dryer in it
ours
ours w/o reconstruction loss
(implausible wash machine configuration)
T2V+C

All videos in this section are generated from JOG3R.
Our T2V+C pipeline can reconstruct 3D cameras consistent with T2V->V2C.
For each pair of frames, we visualize only 10 correspondences to avoid clutter.

a living room with leather chairs and guitars
correspondences from T2V+C
correspondences from T2V->V2C
camera poses from T2V+C
camera poses from T2V->V2C
a backyard with steps leading up to a blue house
correspondences from T2V+C
correspondences from T2V->V2C
camera poses from T2V+C
camera poses from T2V->V2C
a hallway leading to a bathroom and bedroom
correspondences from T2V+C
correspondences from T2V->V2C
camera poses from T2V+C
camera poses from T2V->V2C
an aerial view of a large house on the water
correspondences from T2V+C
correspondences from T2V->V2C
camera poses from T2V+C
camera poses from T2V->V2C