everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Stylized Text-to-Image Generation (STIG) aims to generate images based on text prompts and style reference images. We in this paper propose a novel framework dubbed StyleMaster for this task by leveraging pretrained Stable Diffusion (SD), which addresses previous problems such as misinterpreted style and inconsistent semantics. The enhancement lies in two novel modules: multi-source style embedder and dynamic attention adapter. In order to provide SD with better style embeddings, we propose the multi-source style embedder, which considers both global and local level visual information along with textual information, thereby offering both complementary style-related and semantic-related knowledge. Additionally, aiming for better balance between the adapter capacity and semantic control, the proposed dynamic attention adapter is applied to the diffusion UNet in which adaptation weights are dynamically calculated based on the style embeddings. Two objective functions are introduced to optimize the model alongside the denoising loss, which can further enhance semantic and style consistency. Extensive experiments demonstrate the superiority of StyleMaster over existing methods, rendering images with variable target styles while successfully maintaining the semantic information from the text prompts.