CustomNet: Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

Ziyang Yuan; Mingdeng Cao; Xintao Wang; Zhongang Qi; Chun Yuan; Ying Shan

CustomNet: Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

Ziyang Yuan, Mingdeng Cao, Xintao Wang, Zhongang Qi, Chun Yuan, Ying Shan

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

Incorporating a customized object into image generation presents an attractive feature in text-to-image (T2I) generation. Some methods finetune T2I models for each object individually at test-time, which tend to be overfitted and time-consuming. Others train an extra encoder to extract object visual information for customization efficiently but struggle to preserve the object's identity. To address these limitations, we present CustomNet, a unified encoder-based object customization framework that explicitly incorporates 3D novel view synthesis capabilities into the customization process. This integration facilitates the adjustment of spatial positions and viewpoints, producing diverse outputs while effectively preserving the object's identity. To train our model effectively, we propose a dataset construction pipeline to better handle real-world objects and complex backgrounds. Additionally, we introduce delicate designs that enable location control and flexible background control through textual descriptions or user-defined backgrounds. Our method allows for object customization without the need of test-time optimization, providing simultaneous control over viewpoints, location, and text. Experimental results show that our method outperforms other customization methods regarding identity preservation, diversity, and harmony.

Primary Subject Area: [Generation] Generative Multimedia

Relevance To Conference: (1) We propose CustomNet, a unified framework for object customization that explicitly incorporates 3D novel view synthesis capabilities. CustomNet ensures superior preservation of the object's identity, allowing for simultaneous customization of the viewpoint, location of the object, text, and background-image, without test-time optimization.(2) We develop a novel dataset construction pipeline that effectively leverages synthetic multi-view data and massive natural images to better customize real-world objects and complex backgrounds more harmoniously.(3) Experimental results demonstrate that the proposed CustomNet outperforms existing customization methods regarding identity preservation, diversity, and harmony of the customized results.

Supplementary Material: zip

Submission Number: 3741

Loading