DreamBooth++: Boosting Subject-Driven Generation via Region-Level References Packing

Zhongyi Fan; Zixin Yin; Gang Li; Yibing Zhan; Heliang Zheng

DreamBooth++: Boosting Subject-Driven Generation via Region-Level References Packing

Zhongyi Fan, Zixin Yin, Gang Li, Yibing Zhan, Heliang Zheng

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: DreamBooth has demonstrated significant potential in subject-driven text-to-image generation, especially in scenarios requiring precise preservation of a subject's appearance. However, it still suffers from inefficiency and requires extensive iterative training to customize concepts using a small set of reference images. To address these issues, we introduce DreamBooth++, a region-level training strategy designed to significantly improve the efficiency and effectiveness of learning specific subjects. In particular, our approach employs a region-level data re-formulation technique that packs a set of reference images into a single sample, significantly reducing computational costs. Moreover, we adapt convolution and self-attention layers to ensure their processings are restricted within individual regions. Thus their operational scope (i.e., receptive field) can be preserved within a single subject, avoiding generating multiple sub-images within a single image. Last but not least, we design a text-guided prior regularization between our model and the pretrained one to preserve the original semantic generation ability. Comprehensive experiments demonstrate that our training strategy not only accelerates the subject-learning process but also significantly boosts fidelity to both subject and prompts in subject-driven generation.

Primary Subject Area: [Generation] Generative Multimedia

Relevance To Conference: This work directly contributes to multimedia/multimodal processing by introducing a region-level, subject-centric training strategy for subject-driven generation. This strategy significantly improves the model's ability to accurately preserve specific subjects, crucial for multimedia applications demanding high fidelity visual content. The method not only accelerates the subject-learning process but also boosts the overall fidelity, marking a substantial advancement in generating detailed and accurate multimedia content from textual descriptions.

Supplementary Material: zip

Submission Number: 1697

Loading