Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

Henglei Lv; Jiayu Xiao; Liang Li

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

Henglei Lv, Jiayu Xiao, Liang Li

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Diffusion-based text-to-image personalization has achieved great success in generating user-specified subjects in various contexts. However, finetuning-based methods often suffer from model overfitting, leading to reduced generative diversity, particularly when the provided subject images are limited. To address this issue, we introduce Pick-and-Draw, a training-free semantic guidance approach that enhances identity consistency and generative diversity. Our method comprises two key components: appearance-picking guidance and layout-drawing guidance. In the appearance-picking phase, we create an appearance palette from visual features of the reference image, selecting local patterns to maintain consistent subject identity. In the layout-drawing phase, we use a generative template from the base diffusion model to sketch the subject shape and scene outline, leveraging its strong image prior to produce diverse contexts based on various text prompts. Pick-and-Draw can be seamlessly integrated with any personalized diffusion model and requires only a single reference image. Both qualitative and quantitative evaluations demonstrate that our approach significantly improves identity consistency and generative diversity, establishing a new Pareto frontier in the balance between subject fidelity and image-text alignment.

Primary Subject Area: [Generation] Generative Multimedia

Secondary Subject Area: [Content] Vision and Language, [Experience] Multimedia Applications

Relevance To Conference: Text-to-image personalization is a crucial task in generative multimedia, where the goal is to generate subjects that own the same identity with specific reference images provided by users. In this work, we address the challenges of model overfitting and lack of generative diversity in finetuning-based methods for text-to-image personalization by providing a semantic guidance approach.

Supplementary Material: zip

Submission Number: 1413

Loading