Abstract: Text-to-image (T2I) models are often touted for their supposed ability to create compositional images with many components. However, these models can fail to generate all entities when presented with prompts containing just two or three entities. In this work, we seek an explanation of such failures with respect to the training data. We introduce the training appearance ratio, which compares the number of training images depicting specific entities vs. the number of training captions mentioning those same entities, and examine how well this measure correlates with generation success rates. We then examine whether prompts consisting of various entities will result in successful generations (i.e., images that depict all specified entities) based on the training appearance ratio. We find positive and significant correlations between these ratios and successful image generations. Furthermore, our proposed measure yields stronger correlations with model success rates than existing training data frequency measures. These associations suggest that our measure (training appearance ratio) better captures the relationship between training data and generation success.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Multimodal Models, Text-to-Image Models, Model Analysis, Training Data Analysis
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 3488
Loading