Keywords: Video Generation, Camera Control, Efficiency
Abstract: Controlling camera motion in video diffusion models is highly sought after for content creation, yet remains a significant challenge.
Recent approaches often create anchor videos
(i.e., rendered videos that approximate desired camera motions)
to guide diffusion models as a structured prior,
by rendering from estimated point clouds following camera trajectories.
However, errors in point cloud and camera trajectory estimation often lead to inaccurate anchor videos during training. Furthermore, these inherent errors lead to higher training cost and inefficiency, since the model is forced to compensate for rendering misalignments.
To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework
that constructs well-aligned training anchor videos
without the need for camera pose or point cloud estimation.
Concretely, we create highly precise anchor videos by masking source videos based on first-frame visibility.
This approach ensures strong alignment, eliminates the need for camera/point cloud estimation, and thus can be readily applied to any in-the-wild video
to generate image-to-video (I2V) training pairs.
Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained video diffusion models, with less than 1\% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone.
Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds at test time, enabling precise 3D-informed camera control.
EPiC achieves state-of-the-art performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively.
Notably, EPiC also exhibits strong zero-shot generalization to video-to-video (V2V) scenarios. This is compelling as it is trained exclusively on I2V data, where anchor videos are derived with only source videos' first frame as visibility referencing.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13832
Loading