TEXTTIGER: Text-based Intelligent Generation with Entity Prompt Refinement for Text-to-Image Generation

27 Oct 2025 (modified: 31 Oct 2025)Submitted to IJCNLP-AACL 2025 SRW (ARR Commitment)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Cross-modal content generation, Multimodality, Text-to-image generation, Diffusion model
Submission Category: Long Paper
TLDR: Our propsed method enhances image generation by expanding entity knowledge in prompts and summarizing it using LLMs.
Abstract: Generating images from prompts containing specific entities requires models to retain as much entity-specific knowledge as possible. However, fully memorizing such knowledge is impractical due to the vast number of entities and their continuous emergence. To address this, we propose Text-based Intelligent Generation with Entity prompt Refinement (TextTIGER), which augments knowledge on entities included in the prompts and then summarizes the augmented descriptions using Large Language Models (LLMs) to mitigate performance degradation from longer inputs. To evaluate our method, we introduce WiT-Cub (WiT with Captions and Uncomplicated Background-explanations), a dataset comprising captions, images, and an entity list. Experiments on multiple image generation models and LLMs show that TextTIGER improves image generation performance in standard metrics (IS, FID, and CLIPScore) compared to caption-only prompts. Additionally, multiple annotators' evaluation confirms that the summarized descriptions are more informative, validating LLMs' ability to generate concise yet rich descriptions. These findings demonstrate that refining prompts with augmented and summarized entity-related descriptions significantly enhances image generation capabilities. The dataset will be available upon acceptance.
Student Status Proof: pdf
Paper Link: https://openreview.net/forum?id=puq6Fqt4Sv
Submission Number: 2
Loading