T2I-Scorer: Quantitative Evaluation on Text-to-Image Generation via Fine-Tuned Large Multi-Modal Models

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Text-to-image (T2I) generation is a pivotal and core interest within the realm of AI content generation. Amid the swift advancements of both open-source (such as Stable Diffusion) and proprietary (for example, DALLE, MidJourney) T2I models, there is a notable absence of a comprehensive and robust quantitative framework for evaluating their output quality. Traditional methods of quality assessment overlook the textual prompts when judging images; meanwhile, the advent of large multi-modal models (LMMs) introduces the capability to incorporate text prompts in evaluations, yet the challenge of fine-tuning these models for precise T2I quality assessment remains unresolved. In our study, we introduce the T2I-Scorer, a novel two-stage training methodology aimed at fine-tuning LMMs for T2I evaluation. For the first stage, we collect 397K GPT-4V-labeled question-answer pairs related to T2I evaluation. Termed as T2I-ITD, the pseudo-labeled dataset is analyzed and examined by human, and used for instruction tuning to improve the LMM's low-level quality perception. The first stage model, T2I-Scorer-IT, has reached superior accuracy on T2I evaluation than all kinds of existing T2I metrics under zero-shot settings. For the second stage, we define an explicit multi-task training scheme to further align the LMM with human opinion scores, and the fine-tuned T2I-Scorer can reach state-of-the-art accuracy on both image quality and image-text alignment perspectives with significant improvements. We anticipate the proposed metrics can serve as a reliable metric to gauge the ability of T2I generation models in the future. We will make code, data, and weights publicly available.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: This pioneering study leverages the capabilities of Large Multimodal Models (LMMs) to rigorously assess the generation of multimedia content, specifically through text-to-image processes. This novel approach is particularly significant because both the evaluation mechanism and the subject of the evaluation are intrinsically multimodal. Utilizing LMMs enables a comprehensive analysis of how effectively these models can generate and interpret multimedia content, bridging gaps between text and visual representations in unprecedented ways. The exploration of LMMs in this context is not only timely but is poised to set a new benchmark in the field of multimedia generation. By examining the synergy between different modalities, this research highlights the potential for more intuitive and effective communication technologies. Moreover, the findings from this study are expected to contribute valuable insights into the advancements of AI-driven multimedia tools, making a significant impact on both academic research and practical applications in various industries.
Supplementary Material: zip
Submission Number: 976
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview