Cavia: Camera-controllable Multi-view Video Diffusion

with View-Integrated Attention



Figure A. Monocular Video Comparisons.
Our results showcase outstanding geometric consistency and perceptual quality.
MotionCtrl prefers to generate simplified trajectory,
while CameraCtrl's output contains severe artifacts and distortion.

SVD
MotionCtrl
CameraCtrl
Ours
Reference

Figure B. 2-View Video Comparisons.
Our results show better object motion compared to other methods.

Camera
MotionCtrl
CameraCtrl
Ours

Figure C. 4-View Video Comparisons.
Our results show better geometric consistency compared to those from the concurrent work CVD.
In comparison, CVD tends to generate artifacts in border regions.
Results from CVD and the input image are taken from their website.
The camera sequences used for inferencing our model are panning along straight lines without rotation.

Input
CVD Camera
CVD Result
Our Camera
Our Result

Figure D. 3D Reconstruction Comparisons.
Our generated frames can be reconstructed into 3D scenes with high perceptual quality.
In comparison, CVD's reconstruction result contains floaters and blurry artifacts.
Results from CVD are taken from their website.

CVD
Ours

Figure E. Visualization of Ablation Studies.

Ablation on cross-frame attention.

w/o plucker
w/o cross-frame attention
Full Model
Reference (original 27 frames)

Ablation on cross-view attention.

w/o cross-view attention
with cross-view attention

Ablation on monocular joint training.

w/o monocular joint training
with monocular joint training

Figure F. Additional 2-View Results from Cavia.

\

Figure G. Additional 4-View Results from Cavia.


Figure H. Additional 3D Reconstruction Results from Cavia.
We render the reconstructed 3D Gaussians from an elliptical novel view trajectory.