Keywords: Text-to-Image, Visual Text Generation, Visual Text Editing, Customizable Attributes
TL;DR: AnyText2 introduces a novel approach for precise control over text attributes in natural scene images, achieving faster processing and improved text accuracy compared to its predecessor, while enabling customizable fonts and colors.
Abstract: With the ongoing development in the text-to-image(T2I) domain, accurately generating text within images seamlessly integrating with the visual content has garnered increasing interest from the research community. In addition to controlling glyphs and positions of text, there is a rising demand for more fine-grained control over text attributes, such as font style and color, while maintaining the realism of the generated images. However, this issue has not yet been sufficiently explored. In this paper, we present AnyText2, the first known method to achieve precise control over the attributes of every line of multilingual text when generating images of natural scenes. Our method comprises two main components. First, we introduce an efficient WriteNet+AttnX architecture that encodes text features and injects these intermediate features into the U-Net decoder via learnable attention layers. This design is 19.8% faster than its predecessor, AnyText, and improves the realism of the generated images. Second, we thoroughly explore methods for extracting text fonts and colors from real images, and then develop a Text Embedding Module that employs multiple encoders to separately encode the glyph, position, font, and color of the text. This enables customizable font and color for each text line, yielding a 3.3% and 9.3% increase in text accuracy for Chinese and English, respectively, compared to AnyText. Furthermore, we validate the use of long captions, which enhances prompt-following and image realism without sacrificing text writing accuracy. Through comprehensive experiments, we demonstrate the state-of-the-art performance of our method. The code and model will be open-sourced in the future to promote the development of text generation technology.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11040
Loading