Keywords: latent video models, novel view synthesis
TL;DR: We propose a method for generating fly-through videos of a scene from a single image and a camera trajectory, by introducing 3D camera control into latent video diffusion models.
Abstract: We propose a method for generating fly-through videos of a scene, from a single image and a given camera trajectory.
We build upon an image-to-video latent diffusion model. We condition its UNet denoiser on the camera trajectory, using four techniques.
(1) We condition UNet's temporal blocks on raw camera extrinsics, similar to MotionCtrl.
(2) We use images containing camera ray parameters, similar to CameraCtrl.
(3) We re-project the initial image to subsequent frames and condition on the resulting video.
(4) We introduce a global 3D representation using 2D$\Leftrightarrow$3D transformers, which implicitly conditions on the camera poses.
We combine all conditions in a ContolNet-style architecture. We then propose a metric that evaluates overall video quality and the ability to preserve details with view changes, which we use to analyze the trade-offs of individual and combined conditions. Finally, we identify an optimal combination of conditions. We calibrate camera positions in our datasets for scale consistency across scenes, and we train our scene exploration model, CameraCtrl3D, demonstrating state-of-the-art results.
Example videos generated by CameraCtrl3D are available at https://camctrl3d.github.io/
Submission Number: 389
Loading