T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

ACL ARR 2026 January Submission3639 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: text-to-image generation benchmark, reasoning-informed text-to-image generation, multimodality and language grounding to vision

Abstract: Text-to-image (T2I) generative models have achieved remarkable progress, demonstrating exceptional capability in synthesizing high-quality images from textual prompts. While existing research and benchmarks have extensively evaluated the ability of T2I models to follow the literal meaning of prompts, their ability to reason over prompts with domain knowledge to uncover implicit meaning and contextual nuances remains underexplored. To bridge this gap, we introduce T2I-ReasonBench, a novel benchmark designed to explore the knowledge-driven reasoning capabilities of T2I models. T2I-ReasonBench comprises 800 meticulously designed prompts organized into four dimensions: \textbf{(1) Idiom Interpretation}, \textbf{(2) Textual Image Design}, \textbf{(3) Entity Reasoning}, and \textbf{(4) Scientific Reasoning}. These dimensions challenge models to integrate domain knowledge, infer implicit meaning, and resolve contextual ambiguities. To quantify the performance, we introduce a two-stage evaluation framework: a large language model (LLM) generates prompt-specific question-criterion pairs that evaluate if the image includes the essential elements resulting from correct reasoning; a multimodal LLM (MLLM) then scores the generated image against these criteria. Experiments across 16 state-of-the-art T2I and unified multimodal models reveal critical limitations in reasoning-informed generation. Our comprehensive analysis indicates that the bottleneck of current models is in reasoning rather than generation. This finding highlights the critical need to enhance reasoning abilities in the next generation of T2I and unified multimodal systems.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: text-to-image generation, reasoning ability, benchmark

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: english

Submission Number: 3639

Loading