Keywords: image-to-3D, multiview diffusion, pointmaps
TL;DR: Direct image-to-3D via view synthesis using a novel pointmap representation
Abstract: We present unPIC, a method for generating novel 3D-consistent views of an object from a single image. Given one input view, unPIC produces a full spin of the object around its vertical axis, a process that is typically a precursor for reconstructing the object in 3D.
Our key idea is to predict the object's underlying 3D geometry from the input image _before_ predicting the textured appearance of the novel views. To this end, unPIC consists of two modules: a multiview geometry _prior_, followed by a multiview appearance _decoder_, both implemented as diffusion models but trained separately. During inference, the geometry serves as a blueprint to coordinate the generation of the final novel views, thus enforcing consistency across the object's 360-degree spin. We introduce a novel pointmap-based representation to capture the geometry, with one key advantage: it allows us to obtain a 3D point cloud directly as part of the view-synthesis process, rather than a post-hoc step.
Our modular, geometry-driven framework demonstrates superior performance, outperforming leading methods like InstantMesh, EscherNet, CAT3D, and Direct3D on novel-view quality, geometric accuracy, and multiview-consistency metrics. Furthermore, unPIC shows strong generalization to challenging, real-world captures from datasets like Google Scanned Objects and the Digital Twin Catalog.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 4708
Loading