# PairComp Benchmark

### Introduction

We propose a new benchmark, i.e., <span class="mathvista">PairComp</span>. Each test case in <span class="mathvista">Paircomp</span> contains two similar prompts with subtle differences. By comparing the accuracy of the images generated by the model for each prompt, we evaluate whether the model has focused on the fine-grained semantic differences in the prompts to produce the corresponding correct images. The two prompts in a test case exhibit word-level differences that lead to noticeable distinctions in certain fine-grained semantic aspects. As shown in following figure, these differences can be categorized into six types: (1) Overall appearance difference; (2) Color difference; (3) Counting difference; (4) Position difference; (5) Style & Tone difference; (6) Text difference.

### Evaluation Protocols
In PairComp, we leverage InternVL2.5-26B as the evaluation model with the prompt: ''Does this image match the description? Please directly respond with yes or no.''
We record the probability of the model responding with ''yes'' (denoted as $P_{yes}$) and with ''no'' (denoted as
$P_{no}$), with the semantic consistency score calculated as 
$S(\mathcal{I}, \mathcal{T}) = P_{yes} / (P_{yes} + P_{no})$.
For each prompt, we require a text-to-image model to generate two images. Therefore, for a pair of similar prompts $(\mathcal{T}^{1}_i,\mathcal{T}^{2}_i)$, we obtain four generated images $(\mathcal{I}^{1,1}_i,\mathcal{T}^{1,2}_i, \mathcal{I}^{2,1}_i, \mathcal{I}^{2,2}_i)$.
We then compute the semantic consistency scores for each image with respect to its corresponding prompt: $s^{1,1}_i=S(\mathcal{I}^{1,1}_i, \mathcal{T}^{1}_i)$, $s^{1,2}_i=S(\mathcal{I}^{1,2}_i, \mathcal{T}^{1}_i)$, $s^{2,1}_i=S(\mathcal{I}^{2,1}_i, \mathcal{T}^{2}_i)$, $s^{2,2}_i=S(\mathcal{I}^{2,2}_i,\mathcal{T}^{2}_i)$.
The arithmetic mean score is calculated as: $s_a = \frac{1}{4N} \sum_{i=1}^N( s_i^{1,1}+s_i^{1,2}+s_i^{2,1}+s_i^{2,2})$,
and the geometric mean score is calculated as: $s_g = \frac{1}{N} \sqrt[4]{ s_i^{1,1}\cdot s_i^{1,2}\cdot s_i^{2,1}\cdot s_i^{2,2}} $.
The score of the geometric (arithmetic) mean for ''Average'' is obtained by averaging the geometric (arithmetic) mean scores of the other six sub-tasks.

### Image generation
For each prompt in paircomp, you should instruct the t2i model to generate two images, resulting in a total of 924 × 2 × 2 = 3696 images.
Each image should be named in the following format: X_Y_Z.png, where X represents the prompt ID, Y = {0, 1} indicates whether it corresponds to caption1 (Y=0) or caption2 (Y=1), and Z = {0, 1} indicates whether it is the first (Z=0) or second image (Z=1) generated for the corresponding prompt. All images should be placed in the same folder.
Thus, the generated format should be
```
<IMAGE_FOLDER>/
    0_0_0.png
    0_0_1.png
    0_1_0.png
    0_1_1.png
    1_0_0.png
    1_0_1.png
    1_1_0.png
    1_1_1.png    
    ...
    923_0_0.png
    923_0_1.png
    923_1_0.png
    923_1_1.png   
```

### Evaluation

First download the model [OpenGVLab/InternVL2_5-26B from Hugging Face](https://huggingface.co/OpenGVLab/InternVL2_5-26B) and change [line 97 of evaluate_images.py](evaluate_images.py#97) to the local path corresponding to InternVL2_5-26B. On this basis, you can calculate the semantic consistency score between the generated image and the text prompt through the following command:

```bash
python evaluate_images.py --tgtpath <JSON_PATH> --image_path <IMAGE_FOLDER>
```

This will result in a JSONL file in <JSON_PATH> storing the semantic consistency score for each image. Then you can run

```bash
python summary_scores.py --tgtpath <JSON_PATH>
```

to get the score across each subtask, and the average PairComp score.

### LeaderBoarder

The following are some evaluation results on PairComp of SOTA text-to-image models. 

| Rank |                          Model                          | Model size | Arithmetic mean | Geometric mean |
| :--: | :-----------------------------------------------------: | :--------: | :-------------: | :------------: |
|  🏅️   |   [Janus-FocusDiff-7B](https://arxiv.org/abs/2506.05501)    |     7B       |      85.0       |      83.5      |
|  🥈   |         [SD3-Medium](https://arxiv.org/abs/2403.03206)         |      2B      |      84.4       |      81.4      |
|  🥉   |      [Sana-1.5](https://arxiv.org/abs/2501.18427)       |      4.8B      |      83.2       |      80.0      |
|  4   |       [T2I-R1](https://arxiv.org/abs/2505.00703)        |      7B      |      82.4       |      79.3      |
|  5   |    [Janus-Pro-R1](https://arxiv.org/abs/2506.01480)     |      7B      |      82.0       |      79.2      |
|  6   | [FLUX.1-dev](https://github.com/black-forest-labs/flux) |      12B      |      80.3       |      75.7      |
|  7   |       [BLIP3-o](https://arxiv.org/abs/2505.09568)       |      8B      |      79.3       |      75.5      |
|  8   |      [Infinity](https://arxiv.org/abs/2412.04431)       |      8B      |      77.0       |      72.7      |
|  9   |       [SEED-X](https://arxiv.org/abs/2404.14396)        |      17B      |      74.8       |      71.5      |
|  10  |    [Janus-Pro-7B](https://arxiv.org/abs/2501.17811)     |       7B     |      75.5       |      70.4      |
|  11  |  [Janus-FocusDiff-1B](https://arxiv.org/abs/2506.05501) |      1B      |      71.0       |      68.1      |
|  12  |        [Emu3](https://arxiv.org/abs/2409.18869)         |       8B     |      68.5       |      63.2      |
|  13  |    [PixArt-alpha](https://arxiv.org/abs/2310.00426)     |      0.6B      |      67.5       |      62.7      |
|  14  |    [Janus-Pro-1B](https://arxiv.org/abs/2501.17811)     |     1B       |      64.6       |      59.2      |
|  15  |       [Show-o](https://arxiv.org/abs/2408.12528)        |     1.3B       |      63.6       |      59.1      |
|  16  |       [VILA-U](https://arxiv.org/abs/2409.04429)        |      7B      |      62.9       |      58.0      |
|  17  |     [Janus-Flow](https://arxiv.org/abs/2411.07975)      |     1.3B       |      55.5       |      49.0      |
|  18  |     [VARGPTv1.1](https://arxiv.org/abs/2504.02949)      |      7B      |      53.6       |      48.3      |
|  19  |      [LLamaGen](https://arxiv.org/abs/2406.06525)       |     775M       |      49.1       |      42.3      |
