Enhancing Fine-Tuning Performance of Large-Scale Text-to-Image Models on Specialized Datasets

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: fine-tuning pre-trained models, stable-diffusion, diffusion models, contrastive learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Fine-tuning pre-trained large-scale text-to-image models on specialized datasets has gained popularity for downstream image generation tasks. However, direct fine-tuning of Stable-Diffusion on such datasets often falls short of yielding satisfactory outcomes. To delve into the underlying reasons, we introduce a novel perspective to investigate the intrinsic factors impacting fine-tuning outcomes. We identified that the limitations in fine-tuning stem from an inability to effectively improve text-image alignment and reduce text-image alignment drift. To tackle this issue, we leverage the powerful optimization capabilities of contrastive learning for feature distribution. By explicitly refining text feature representations during generation, we aim to enhance text-image alignment and minimize the alignment drift, thereby improving the fine-tuning performance on specialized datasets. Our approach is versatile, resource-efficient, and seamlessly integrates with existing controllable generation methods. Experimental results demonstrate a significant enhancement in fine-tuning performance achieved by our method.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5262
Loading