GRADE: A Fine-grained Approach to Measure Sample Diversity in Text-to-Image Models

NeurIPS 2024 Workshop ATTRIB Submission62 Authors

Published: 30 Oct 2024, Last Modified: 14 Jan 2025ATTRIB 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: text-to-image, evaluation, diversity
Abstract: Existing diversity metrics like Fréchet Inception Distance (FID) and Recall require reference images and are generally not reliable. Evaluating the diversity of text-to-image (T2I) model outputs remains a challenge, especially in capturing fine-grained variations essential for creativity and bias mitigation. We propose \textbf{Gr}anular \textbf{A}ttribute \textbf{D}iversity \textbf{E}valuation (GRADE), a descriptive and fine-grained method for assessing sample diversity in T2I models without requiring reference images. GRADE estimates the distribution of attributes within generated images of a concept, such as the shape or flavor distribution of the concept ``cookie'', and computes its normalized entropy, providing interpretable insights into model behavior and a diversity score. We show GRADE achieves over 90\% agreement with human evaluation while having weak correlation to FID and Recall, indicating it captures new, fine-grained forms of diversity. We use GRADE to measure and compare the diversity of 12 T2I models and reveal that the most advanced models are the least diverse, scoring just 0.47 entropy and defaulting to depicting concepts with the same attributes (e.g., cookies are round) 88\% of the time, despite varied prompts. We observe an inherent trade-off between diversity and prompt adherence, akin to the Precision-Recall trade-off and negative correlation between diversity and model size. We identify that underspecified captions in training data contribute significantly to low sample diversity, leading models to depicting concepts with the same attributes. GRADE serves as a valuable tool for benchmarking and guiding the development of more diverse T2I models.
Submission Number: 62
Loading