CamI2V: Camera-Controlled Image-to-Video Diffusion Model
*Under Review
More Visualization
*Generated by 512x320 model (50k training steps), compatible with input images of arbitary aspect ratio.
Pan Left
Pan Right
Pan Up
Pan Down
Look Left
Look Right
Orbit Left
Orbit Right
Zoom In & Rotate
Pan Left & Zoom
Forward → Backward
Walking
Visualization (512x320)
*Original outputs from 512x320 model, no padding removed.
Visualization (256x256)
Orbit Left
Orbit Right
Zoom In
Zoom Out
More Ablation
CamI2V (Ours)
CamI2V - 3D full attention
CamI2V - epipolar attention only on reference frame (similar to CamCo)
CameraCtrl
MotionCtrl
Due to the direct cross-frame interactions (epipolar attention or 3D full attention), CamI2V
and 3D full attention succeed in panning right with a large camera movement, while CameraCtrl and MotionCtrl
fail. However, we can see some blur or color shift in the left of the 3D full attention, this is because
3D full attention have access to all the noisy features (noisy condition) across frames, leading to incorrect absorbing in color.
Epipolar only on reference frame (CamCo-like) also fails not only because the limited access to noisy condition
(the newly appeared pixels have no intersections on the reference frame) but also too much copy of reference image
leads to static scene.