TL;DR: Create 3D scenes from any number of real or generated images.
How it works
Given any number of input images, we use a multi-view diffusion model conditioned on those images to generate novel views of the scene. The resulting views are fed to a robust 3D reconstruction pipeline, producing a 3D representation that can be rendered interactively. The total processing time (including both view generation and 3D reconstruction) runs in as little as one minute.
Comparisons to other methods
Compare the renders and depth maps of our method CAT3D (right) with baseline methods (left). Try selecting different
methods and scenes!
CAT3D uses a multi-view latent diffusion model to generate novel views of the scene. This model can be conditioned on any number of observed views (input images with corresponding camera poses embedded as ray coordinates), and is trained to produce multiple consistent novel images of the scene at specified target viewpoints. This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS).