Abstract: Scene image is one of the important windows for showcasing product design. To obtain it, the standard 3D-based pipeline requires designer to not only create the 3D model of product, but also manually construct the entire scene in software, which hindering its adaptability in situations requiring rapid evaluation. This study aims to realize a novel conditional synthesis method to create the scene image based on a single-model rendering of the desired object and the scene description. In this task, the major challenges are ensuring the strict appearance fidelity of drawn object and the overall visual harmony of synthesized image. The former's achievement relies on maintaining an appropriate condition-output constraint, while the latter necessitates a well-balanced generation process for all regions of image. In this work, we propose Scene Diffusion framework to meet these challenges. Its first progress is introducing the Shading Adaptive Condition Alignment (SACA), which functions as an intensive training objective to promote the appearance consistency between condition and output image without hindering the network's learning to the global shading coherence. Afterwards, a novel low-to-high Frequency Progression Training Schedule (FPTS) is utilized to maintain the visual harmony of entire image by moderating the growth of high-frequency signals in the object area. Extensive qualitative and quantitative results are presented to support the advantages of the proposed method. In addition, we also demonstrate the broader uses of Scene Diffusion, such as its incorporation with ControlNet.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Text-driven image synthesizing has garnered significant interest in the field of multimedia over the past few years. Its effectiveness and ease of use have also made it a favored tool in the design industry. In this work, we aim to realize a text-driven image synthesizing method to assist the designer to create the scene image about their product solely based on a single 3D model and scene description text. Upon create the 3D product model, the designer only needs to provide a single-model rendering image in the desired position and posture, the network will then generate a complete scene image according to the rendered image and the scene description. In contrast to traditional 3D-based pipeline, this framework eliminates the laborious scene construction step and is expected to significantly increases the efficiency of multimedia design.
Supplementary Material: zip
Submission Number: 5530
Loading