Keywords: text-to-image generation benchmark, reasoning-informed text-to-image generation, multimodality and language grounding to vision
Abstract: Text-to-image (T2I) generative models have achieved remarkable progress, demonstrating exceptional capability in synthesizing high-quality images from textual prompts. While existing research and benchmarks have extensively evaluated the ability of T2I models to follow the literal meaning of prompts, their ability to reason over prompts with domain knowledge to uncover implicit meaning and contextual nuances remains underexplored. To bridge this gap, we introduce T2I-ReasonBench, a novel benchmark designed to explore the knowledge-driven reasoning capabilities of T2I models.
T2I-ReasonBench comprises 800 meticulously designed prompts organized into four dimensions: \textbf{(1) Idiom Interpretation}, \textbf{(2) Textual Image Design}, \textbf{(3) Entity Reasoning}, and \textbf{(4) Scientific Reasoning}. These dimensions challenge models to integrate domain knowledge, infer implicit meaning, and resolve contextual ambiguities. To quantify the performance, we introduce a two-stage evaluation framework: a large language model (LLM) generates prompt-specific question-criterion pairs that evaluate if the image includes the essential elements resulting from correct reasoning; a multimodal LLM (MLLM) then scores the generated image against these criteria.
Experiments across 16 state-of-the-art T2I and unified multimodal models reveal critical limitations in reasoning-informed generation.
Our comprehensive analysis indicates that the bottleneck of current models is in reasoning rather than generation.
This finding highlights the critical need to enhance reasoning abilities in the next generation of T2I and unified multimodal systems.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: text-to-image generation, reasoning ability, benchmark
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: english
Submission Number: 3639
Loading