Abstract: Generating 3D shapes according to specific textual input is a crucial topic in the multimedia application, with its potential enhancement to the VR/AR/XR usage that enables more diverse virtual scenes. Due to the recent success of diffusion models, text-guided 3D object generation has drawn a lot of attention recently. However, current latent diffusion-based methods are restricted to shape-only generation, requiring time-consuming and computationally expensive post-processing to obtain colored objects. In this paper, we propose an end-to-end Shape-Color Diffusion Prior framework (SCDiff) to achieve colored text-to-3D object generation. Given a general text description as input, our SCDiff is able to distinguish shape and color-related priors in the text and generate a shape latent and a color latent for a pre-trained 3D object auto-encoder to derive colored 3D objects. Our SCDiff contains two 3D latent diffusion models (LDM), where one generates the shape latent from the input text and the other generates the color latent. To help the two LDMs focus on shape/color-related information, we further adopt a Large Language Model (LLM) to separate the input text into a shape phrase and a color phrase via an in-context learning technique so that our shape/color LDM would not be influenced by irrelevant information. Due to the separation of shape and color latent, we are able to manipulate the color of an object by giving different color phrases while maintaining the original shape. Experiments on a benchmark dataset would quantitatively and qualitatively verify the effectiveness and practicality of our proposed model. As an extension, we show the capability of our SCDiff on 3D object generation and manipulation based on various modality conditions, which further confirms the scalability and applications in multimedia of our proposed framework.
External IDs:dblp:journals/tmm/HuangHCCLW25
Loading