everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
Diffusion models have shown remarkable performance in generating a broad spectrum of visual content. However, their text rendering ability is still limited: they generate wrong characters or words that cannot blend well with the background image. To address this, we introduce a novel framework named ARTIST, which includes an additional textual diffusion model focusing on text structure learning. We first pretrain the textual diffusion model. Then we further fine-tune the visual model to learn how to inject textual structure information from the frozen textual model into the image. This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation. Furthermore, we leverage pre-trained large-language models to infer the user's intention leading to better generation quality. Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.