Long-Text-to-Image Generation via Compositional Prompt Decomposition

Long-Text-to-Image Generation via Compositional Prompt Decomposition

ICLR 2026 Conference Submission11 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Compositionality; Text-to-Image Generation; Generative Model Generalization

TL;DR: We decompose long-prompts to allow pre-trained Text-to-Image models to handle long-prompts input, demonstrating superior generalization as prompt length increases.

Abstract: While modern text-to-image models excel at generating images from intricate prompts, they struggle to capture the key details when the prompts are expanded into descriptive paragraphs. This limitation stems from the prevalence of short captions in their training data. Existing methods attempt to address this by either fine-tuning the pre-trained models, which generalizes poorly to even longer inputs; or by projecting the oversize inputs into short-prompt domain and compromising fidelity. We propose a compositional approach that enables pre-trained models to handle long-prompt by breaking it down into manageable components. Specifically, we introduce a trainable PromptDecomposer module to decompose the long-prompt into a set of distinct sub-prompts. The pre-trained T2I model processes these sub-prompts in parallel, and their corresponding outputs are merged together using concept conjunction. Our compositional long-text-to-image model achieves performance comparable to those with specialized tuning. Meanwhile, our approach demonstrates superior generalization, outperforming other models by 7.4\% on prompts over 500 tokens in the challenging DetailMaster benchmark.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 11

Loading