GenAI-Bench: A Holistic Benchmark for Compositional Text-to-Visual Generation

Published: 09 Apr 2024, Last Modified: 23 Apr 2024SynData4CVEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generative Models, Vision-Language Models, Automatic Evaluation, Visio-Linguistic Compositionality
TL;DR: We introduce GenAI-Bench to evaluate compositional text-to-visual generation and automated evaluation metrics.
Abstract: Text-to-visual models can now generate photo-realistic images and videos that accurately depict objects and scenes. Still, they struggle with compositions of attributes, relationships, and higher-order reasoning such as counting, comparison, and logic. Towards this end, we introduce {\bf GenAI-Bench} to evaluate compositional text-to-visual generation through 1,600 high-quality prompts collected from professional designers, surpassing the difficulty and diversity of existing benchmarks like PartiPrompt and T2I-CompBench. Our human and automated evaluations on GenAI-Bench reveal that state-of-the-art models like DALL-E 3, StableDiffusion, and Gen2 often fail to parse user prompts requiring advanced compositional reasoning. Finally, we release over 24,000 human ratings on synthetic images and videos produced by ten leading generative models (with the numbers still growing) to support the development of automated text-to-visual evaluation metrics.
Submission Number: 48