Keywords: T2I models, Benchmark
Abstract: The rapid advancements of Text-to-Image (T2I) models have ushered in a new
phase of AI-generated content, marked by their growing ability to interpret and
follow user instructions. However, existing T2I model evaluation benchmarks fall
short in limited prompt diversity and complexity, as well as coarse evaluation met-
rics, making it difficult to evaluate the fine-grained alignment performance between
textual instructions and generated images. In this paper, we present **TIIF-Bench**
(**T**ext-to-**I**mage **I**nstruction **F**ollowing **Bench**mark), aiming to systematically as-
sess T2I models’ ability in interpreting and following intricate textual instructions.
TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions,
which are categorized into three levels of difficulties and complexities. To rigor-
ously evaluate model robustness to varying prompt lengths, we provide a short and
a long version for each prompt with identical core semantics. Two critical attributes,
i.e., text rendering and style control, are introduced to evaluate the precision of text
synthesis and the aesthetic coherence of T2I models. In addition, we collect 100
high-quality designer level prompts that encompass various scenarios to compre-
hensively assess model performance. Leveraging the world knowledge encoded
in large vision language models, we propose a novel computable framework to
discern subtle variations in T2I model outputs. Through meticulous benchmarking
of mainstream T2I models on TIIF-Bench, we analyze the pros and cons of current
T2I models and reveal the limitations of current T2I benchmarks.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 1524
Loading