Abstract: Synthesizing 3D content from single image has great potential in many real-world applications. To deal with the inherent ambiguity of single image, existing methods usually leverage pre-trained 2D diffusion models for computational intensive per-instance optimization. Although having been able to create 3D assets in a feed-forward manner, the efficacy of recent advances in 3D foundation models is still limited due to neglecting geometric cues from images. To address this issue, we propose an efficient 3D foundation model named LPM to synthesize 3D content from an image. Like the masked modeling in the 2D image domain, the key of our approach is to learn 3D representations from incomplete visible shapes. By taking advantage of a synthesis-by-analysis paradigm, we establish an efficient pipeline to first estimate the visible portions and then generate the complete 3D representations. Based on the principle that an image is the projection of 3D model, we initially estimate partial 3D voxel features from single image, which are further projected onto orthogonal planes to form an incomplete yet efficient triplane representation. Subsequently, an autoencoder is employed to model a complete triplane representation based on the incomplete parts. We train our model on massive data with over 250 million parameters to enhance its generalization capability. The experimental results show that LPM can generate high-fidelity 3D objects from an image within 0.1 seconds, which is more effective than the existing feed-forward approaches. Our implementation and pre-trained models will be made publicly available.
External IDs:dblp:journals/tcsv/ZhangYWZ25
Loading