CamI2V: Camera-Controlled Image-to-Video Diffusion Model

*Under Review

More Visualization

*Generated by 512x320 model (50k training steps), compatible with input images of arbitary aspect ratio.

Pan Left

Pan Right

Pan Up

Pan Down


Look Left

Look Right

Orbit Left

Orbit Right


Zoom In & Rotate

Pan Left & Zoom

Forward → Backward

Walking

Visualization (512x320)

*Original outputs from 512x320 model, no padding removed.

Visualization (256x256)

Orbit Left

Orbit Right


Zoom In

Zoom Out

More Ablation

CamI2V (Ours)

CamI2V - 3D full attention

CamI2V - epipolar attention
only on reference frame
(similar to CamCo)

CameraCtrl

MotionCtrl


Due to the direct cross-frame interactions (epipolar attention or 3D full attention), CamI2V and 3D full attention succeed in panning right with a large camera movement, while CameraCtrl and MotionCtrl fail. However, we can see some blur or color shift in the left of the 3D full attention, this is because 3D full attention have access to all the noisy features (noisy condition) across frames, leading to incorrect absorbing in color. Epipolar only on reference frame (CamCo-like) also fails not only because the limited access to noisy condition (the newly appeared pixels have no intersections on the reference frame) but also too much copy of reference image leads to static scene.