Abstract: Driven by powerful image diffusion models, recent research has achieved the automatic creation of 3D objects from textual or visual guidance. By performing score distillation sampling (SDS) iteratively across different views, these methods succeed in lifting 2D generative prior to the 3D space.
However, such a 2D generative image prior bakes the effect of illumination and shadow into the texture.
As a result, material maps optimized by SDS inevitably involve spurious correlated components.
The absence of precise material definition makes it infeasible to relight the generated assets reasonably in novel scenes, which limits their application in downstream scenarios. In contrast, humans can effortlessly circumvent this ambiguity by deducing the material of the object from its appearance and semantics.
Motivated by this insight, we propose MaterialSeg3D, a 3D asset material generation framework to infer underlying material from the 2D semantic prior.
Based on such a prior model, we devise a mechanism to parse material in 3D space.
We maintain a UV stack, each map of which is unprojected from a specific viewpoint.
After traversing all viewpoints, we fuse the stack through a weighted voting scheme and then employ region unification to ensure the coherence of the object parts.
To fuel the learning of semantics prior, we collect a material dataset, named Materialized Individual Objects (MIO), which features abundant images, diverse categories, and accurate annotations.
Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Social Aspects of Generative AI
Relevance To Conference: We resort to a 2D segmentation model and collect a dataset with material annotations to fuel the learning of semantics prior.
The dataset, termed Materialized Individual Objects (MIO), is a collection of images captured from typical to extreme viewpoints, from virtual rendering to real-world photography.
Based on such a prior model, we devise a mechanism for material parsing.
Rather than refreshing a single UV map like existing texture generation methods, we maintain a UV stack wherein each map is unprojected from an observation view.
We fuse the stack through a weighted voting scheme and then employ region unification to ensure the coherence of the object parts.
Extensive experiments and visualizations demonstrate the effectiveness of our method.
Supplementary Material: zip
Submission Number: 179
Loading