# LLMScore
In this work, we propose LLMScore, a new framework that offers evaluation scores with multi-granularity compositionality. LLMScore leverages the large language models (LLMs) to evaluate text-to-image models.

## Overview
The two images are generated using Stable-Diffusion-2 based on the text prompt sampled from the Concept Conjunction dataset. *Baseline* section shows the scores from the existing model-based evaluation metrics, *Human* section is the rating score from the human evaluation, *LLMScore* section is our proposed metric. The right column also shows the rationale generated by LLMScore.
<p align="center">
<img src="assets/teaser.png" width="1024px"></img>
</p>
Comparison of Text-Image Matching, Sentence Matching, and our LLM-based Instruction-Following Matching pipeline for text-to-image synthesis evaluation. Our LLMScore automatically provides accurate scores and reasonable rationales for text-to-image synthesis based on text prompts, and visual descriptions following various evaluation instructions.
<p align="center">
<img src="assets/matching_pipeline.png" width="1024px"></img>
</p>



## Installation
Please follow [install](INSTALL.md) page to set up the environments and models.

## Text-to-Image Synthesis Evaluation
Get score with rationale for evaluating the alignment between image and text prompt.
```
python llm_score.py --image sample/sample.png --text_prompt "a red car and a white sheep"
```

## LLMScore with Rationale

<p align="center">
<img src="assets/showcase.png" width="1024px"></img>
</p>

## Human Correlation

The rank correlation (Kendall's tau) is aggregated across the compositional prompt dataset (Concept Conjunction, Attribute Binding Contrast) on the left two columns (CompBench) and the general prompt dataset (MSCOCO, DrawBench, PaintSkills) on the right two columns (GeneralBench).
<p align="center">
<img src="assets/vis_kendall.png" width="1024px"></img>
</p>

