CamCo: Camera-Controllable 3D-Consistent

Image-to-Video Generation



 

Figure A. Comparisons against the concurrent work CameraCtrl.
Our results showcase outstanding consistency.
CameraCtrl's output contains obvious flickering artifacts and distortions.

CameraCtrl
Ours
Reference Video

A blue chair on a carpet in a living room.

Majestic mountains with an eagle gliding effortlessly through the sky.

 

Figure B: 3D Reconstruction Results of CamCo's Generated Videos.
We provide novel view renderings of the 3D scenes reconstructed from CamCo's output videos. This 3D-consistency of the video frames is hard to achieve in previous methods.

Generated Videos (Object Centric)

Reconstructed 3D Scenes (Object Centric)

Generated Videos (In-door and Out-door Scenes)

Reconstructed 3D Scenes (In-door and Out-door Scenes)

 

Figure C. CamCo Evaluated on In-the-wild Images.
We provide additional results of CamCo evaluated on in-the-wild images.

 

Figure D: More Dynamic Results and Comparisons.
We provide additional dynamic results of CamCo and MotionCtrl.
We invite the reviewers to evaluate comparatively.
MotionCtrl tends to produce static results with little-to-no object motion.

MotionCtrl
Ours
 

Figure E: Qualitative Comparisons for Ablation Studies.
We provide qualitative comparisons of our ablation study.
We invite the reviewers to evaluate comparatively.

(a). Effectiveness of our proposed Epipolar Constraint Attention (ECA).
Texture of the floor is well-preserved at novel viewpoints when ECA is present.
CameraCtrl's output show severe inconsistency between frames.

CameraCtrl
Without ECA
With ECA

A view of a living room from an open doorway.

(b). Effectiveness of our proposed dynamic data curation pipeline.
Without the proposed data curation pipeline, camera pose estimates are noisy and the video generator is not able to generate both object motion and camera motion.
MotionCtrl, in comparison, is only able to generate static scenes (the water is fixed).

MotionCtrl
Without Curation
With Curation

A view of a living room from an open doorway.

 

Figure F: Static Generation Comparison (RealEstate-10k)

Stable Video Diffusion
MotionCtrl
Ours
Reference Video
 

Figure G: Dynamic Generation Comparison (T2I)

Stable Video Diffusion
MotionCtrl
Ours
Reference Video

A cozy campfire in the woods at night, with logs burning brightly, sparks flying, and people sitting around it.

A waterfall surrounded by trees with bright red, orange, and yellow leaves, with fallen leaves floating on the water.

A mountain peak surrounded by forests in vibrant autumn colors, with a clear sky overhead.

Majestic mountains with an eagle gliding effortlessly through the sky.