Figure A. Monocular Video Comparisons. Our results showcase outstanding
geometric consistency and perceptual quality. MotionCtrl prefers to generate
simplified trajectory, while CameraCtrl's output contains severe artifacts and
distortion.
SVD
MotionCtrl
CameraCtrl
Ours
Reference
Figure B. 2-View Video Comparisons. Our results show better object
motion compared to other methods.
Camera
MotionCtrl
CameraCtrl
Ours
Figure C. 4-View Video Comparisons. Our results show better geometric
consistency compared to those from the concurrent work CVD. In comparison, CVD tends
to generate artifacts in border regions. Results from CVD and the input image are
taken from their website. The camera sequences used for inferencing our model are
panning along straight lines without rotation.
Input
CVD Camera
CVD Result
Our Camera
Our Result
Figure D. 3D Reconstruction Comparisons. Our generated frames can be
reconstructed into 3D scenes with high perceptual quality. In comparison, CVD's
reconstruction result contains floaters and blurry artifacts. Results from CVD are
taken from their website.
CVD
Ours
Figure E. Visualization of Ablation Studies.
Ablation on cross-frame attention.
w/o plucker
w/o cross-frame attention
Full Model
Reference (original 27 frames)
Ablation on cross-view attention.
w/o cross-view attention
with cross-view attention
Ablation on monocular joint training.
w/o monocular joint training
with monocular joint training
Figure F. Additional 2-View Results from Cavia.
\
Figure G. Additional 4-View Results from Cavia.
Figure H. Additional 3D Reconstruction Results from Cavia. We render the reconstructed 3D Gaussians from an elliptical novel view trajectory.