TextAtlas5M: A Large-Scale Dataset for Long and Structured Text Image Generation

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: long-form text, text rendering, text-conditioned image generation
TL;DR: We introduce TextAtlas5M, a large-scale dataset and benchmark for long-text image generation that reveals significant performance gaps and improves model training.
Abstract: Text-conditioned image generation has gained significant attention in recent years and is processing increasingly longer and comprehensive text prompts. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the rendering of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering, where ``long text'' refers not only to textual length but also to layout complexity and semantic richness. In our context, long text involves dense visual content, hierarchical structures, and interleaved text-image layouts, as exemplified by subsets like TextVisionBlend, PPT2Structured, CoverBook, and TextScenesHQ. Our dataset consists of 5 million generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 4,000 human-improved test cases (TextAtlasEval) across 4 domains, establishing one of the most extensive benchmarks for text rendering. Evaluations suggest that TextAtlasEval presents significant challenges even for the most advanced proprietary models (e.g., GPT4o), while open-source counterparts show an even larger performance gap. Notably, diffusion and autoregressive models with weak text rendering improve substantially after training on our dataset. These findings position TextAtlas5M as a valuable resource for training and evaluating next-generation text-conditioned image generation models.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 7029
Loading