# 测评数据集

我们遵循WeCLIP中的模型评估协议，使用图文跨模态检索，和零样本分类来评估模型的性能。

对于中文的评估，我们直接使用WeCLIP中预处理的代码来对评估数据集进行预处理。
预处理好的数据集和下载链接如下，下载后可以直接进行评估，无需任何额外的预处理操作。 
* Flickr30K-CN 数据 (通用检索)： [下载链接](https://drive.weixin.qq.com/s?k=AJEAIQdfAAoU6BPeQK)
* COCO-CN 数据 (通用检索)： [下载链接](https://drive.weixin.qq.com/s?k=AJEAIQdfAAozxsd5wx)
* MUGE 数据 (电商检索)：[下载链接](https://drive.weixin.qq.com/s?k=AJEAIQdfAAoV1LU6Tc)
* ELEVATER 数据 （零样本分类）: [下载链接](https://drive.weixin.qq.com/s?k=AJEAIQdfAAo0beA5kC)

下载好上述的预处理好的数据集之后，可以运行评估代码进行评估：
```shell
# 检索任务评估, 以Flickr30k-CN为例
python evaluate/eval_retrieval.py --model_name YouCLIP-Base --model_checkpoint $CHECKPOINT_PATH --text_info_path $DIR_Retrival/Flickr30k-CN/meta_weclip/test_text_info.jsonl --img_info_path $DIR_Retrival/Flickr30k-CN/meta_weclip/test_image_info.jsonl --img_tsv_path $DIR_Retrival/Flickr30k-CN/test_imgs.tsv  --relation_info_path $DIR/Flickr30k-CN/meta_weclip/test_relation_info.jsonl 

# 零样本分类任务评估
python evaluate/eval_zs_cls.py --model_name YouCLIP-Base --model_checkpoint $CHECKPOINT_PATH --elevater_path $DIR_Elevater
```

其中，`model_name`为模型名称，包含`'YouCLIP-Base', 'YouCLIP-Base-CN-ENG', 'YouCLIP-Base-512', 'YouCLIP-Base-512-CN-ENG', 'YouCLIP-Large', 'YouCLIP-Large-CN-ENG', 'YouCLIP-Huge', 'YouCLIP-Huge-CN-ENG'`这些选择。 
`$CHECKPOINT_PATH`为下载的模型文件的路径。
`$DIR_Retrival`为下载的检索数据集所在的文件夹路径。
`$DIR_Elevater`为下载的ELEVATER所在的路径。

此处我们只提供处理好的评估数据集，如需要完整的原始数据集，可以参考[WeCLIP评估](https://git.woa.com/mmvision/image/weclip/blob/master/EVAL.md)。

对于英文任务，我们遵循之前CLIP的工作，使用ImageNet-1K上的零样本分类的Top 1 Accuracy来进行评估。


# 测评详细结果

对于中文检索和英文能力的评估，其与WeCLIP的对比的结果见下表：



<table style="width:100%; text-align:center;">
  <style>
    .light-blue-background {
      background-color: #e0f7fa;
    }
    th {
      text-align: center;
      vertical-align: middle;
    }
  </style>
  <thead>
    <tr align="center">
      <th rowspan="2" style="text-align:center; vertical-align:middle;"> </th>
      <th colspan="6" style="text-align:center; vertical-align:middle;">COCO-CN Retrieval</th>
      <th colspan="6" style="text-align:center; vertical-align:middle;">Flickr30K-CN Retrieval</th>
      <th colspan="6" style="text-align:center; vertical-align:middle;">MUGE Retrieval</th>
      <th rowspan="2" style="text-align:center; vertical-align:middle;">英文能力<br>ImageNet-1K</th>
    </tr>
    <tr align="center">
      <th colspan="3" style="text-align:center; vertical-align:middle;">image-to-text</th>
      <th colspan="3" style="text-align:center; vertical-align:middle;">text-to-image</th>
      <th colspan="3" style="text-align:center; vertical-align:middle;">image-to-text</th>
      <th colspan="3" style="text-align:center; vertical-align:middle;">text-to-image</th>
      <th colspan="3" style="text-align:center; vertical-align:middle;">image-to-text</th>
      <th colspan="3" style="text-align:center; vertical-align:middle;">text-to-image</th>
    </tr>
    <tr>
      <th></th>
      <th style="text-align:center; vertical-align:middle;">R@1</th>
      <th style="text-align:center; vertical-align:middle;">R@5</th>
      <th style="text-align:center; vertical-align:middle;">R@10</th>
      <th style="text-align:center; vertical-align:middle;">R@1</th>
      <th style="text-align:center; vertical-align:middle;">R@5</th>
      <th style="text-align:center; vertical-align:middle;">R@10</th>
      <th style="text-align:center; vertical-align:middle;">R@1</th>
      <th style="text-align:center; vertical-align:middle;">R@5</th>
      <th style="text-align:center; vertical-align:middle;">R@10</th>
      <th style="text-align:center; vertical-align:middle;">R@1</th>
      <th style="text-align:center; vertical-align:middle;">R@5</th>
      <th style="text-align:center; vertical-align:middle;">R@10</th>
      <th style="text-align:center; vertical-align:middle;">R@1</th>
      <th style="text-align:center; vertical-align:middle;">R@5</th>
      <th style="text-align:center; vertical-align:middle;">R@10</th>
      <th style="text-align:center; vertical-align:middle;">R@1</th>
      <th style="text-align:center; vertical-align:middle;">R@5</th>
      <th style="text-align:center; vertical-align:middle;">R@10</th>
      <th style="text-align:center; vertical-align:middle;">Top 1 ACC</th>
    </tr>
  </thead>
  <tbody>
    <tr>
    <th>Base</th>
    </tr>
    <tr align="center">
      <td><i>WeCLIP-Base</i></td>
      <td>65.5</td>
      <td>89.4</td>
      <td>95.2</td>
      <td>66.6</td>
      <td>89.6</td>
      <td>95.6</td>
      <td>85.7</td>
      <td>98.2</td>
      <td>99.6</td>
      <td>71.3</td>
      <td>91.3</td>
      <td>95.4</td>
      <td>40.0</td>
      <td>68.1</td>
      <td>77.4</td>
      <td>52.8</td>
      <td>76.8</td>
      <td>84.7</td>
      <td>----</td>
    </tr>
    <tr align="center" class="light-blue-background">
      <td>YouCLIP-Base-CN-ENG</td>
      <td>70.1</td>
      <td>92.1</td>
      <td><strong>97.4</strong></td>
      <td><strong>68.2</strong></td>
      <td>90.1</td>
      <td>95.9</td>
      <td><strong>88.9</strong></td>
      <td>98.8</td>
      <td>99.5</td>
      <td>71.0</td>
      <td>91.5</td>
      <td>95.2</td>
      <td>48.4</td>
      <td>77.0</td>
      <td>85.2</td>
      <td>57.0</td>
      <td>81.3</td>
      <td>87.6</td>
      <td><strong>71.52</strong></td>
    </tr>
    <tr align="center" class="light-blue-background">
      <td><strong>YouCLIP-Base</strong></td>
      <td><strong>70.5</strong></td>
      <td><strong>92.4</strong></td>
      <td>97.2</td>
      <td>67.4</td>
      <td><strong>90.7</strong></td>
      <td><strong>96.2</strong></td>
      <td>88.7</td>
      <td><strong>98.9</strong></td>
      <td><strong>99.8</strong></td>
      <td><strong>71.8</strong></td>
      <td><strong>92.1</strong></td>
      <td><strong>95.5</strong></td>
      <td><strong>53.2</strong></td>
      <td><strong>81.7</strong></td>
      <td><strong>89.0</strong></td>
      <td><strong>61.9</strong></td>
      <td><strong>85.3</strong></td>
      <td><strong>91.3</strong></td>
      <td>----</td>
    </tr>
    <tr>
    <th>Large</th>
    </tr>
    <tr align="center">
      <td><i>WeCLIP-Large</i></td>
      <td>66.9</td>
      <td>89.9</td>
      <td>96.2</td>
      <td>66.6</td>
      <td>89.6</td>
      <td>95.7</td>
      <td>88.7</td>
      <td>98.6</td>
      <td>99.7</td>
      <td>75.1</td>
      <td>92.6</td>
      <td>96.4</td>
      <td>45.9</td>
      <td>73.2</td>
      <td>81.8</td>
      <td>59.7</td>
      <td>82.0</td>
      <td>88.3</td>
      <td>----</td>
    </tr>
    <tr align="center" class="light-blue-background">
      <td>YouCLIP-Base-512-CN&ENG</td>
      <td>70.7</td>
      <td>92.8</td>
      <td>96.8</td>
      <td>69.1</td>
      <td>91.0</td>
      <td>95.7</td>
      <td>92.8</td>
      <td>98.9</td>
      <td>99.8</td>
      <td><strong>77.7</strong></td>
      <td>94.2</td>
      <td><strong>97.1</strong></td>
      <td>53.8</td>
      <td>82.8</td>
      <td>89.9</td>
      <td>61.8</td>
      <td>85.7</td>
      <td>92.1</td>
      <td><strong>75.18</strong></td>
    </tr>
    <tr align="center" class="light-blue-background">
      <td>YouCLIP-Base-512</td>
      <td>72.5</td>
      <td>92.8</td>
      <td>97.4</td>
      <td>69.0</td>
      <td><strong>91.4</strong></td>
      <td>96.6</td>
      <td><strong>93.5</strong></td>
      <td>99.0</td>
      <td>99.6</td>
      <td>77.3</td>
      <td><strong>94.2</strong></td>
      <td>97.0</td>
      <td>52.4</td>
      <td>81.3</td>
      <td>88.7</td>
      <td>61.0</td>
      <td>84.4</td>
      <td>90.7</td>
      <td>----</td>
    </tr>
    <tr align="center" class="light-blue-background">
      <td>YouCLIP-Large-CN-ENG</td>
      <td>73.1</td>
      <td>93.3</td>
      <td>97.4</td>
      <td>70.1</td>
      <td>90.6</td>
      <td><strong>96.7</strong></td>
      <td>90.8</td>
      <td>99.3</td>
      <td>99.7</td>
      <td>76.8</td>
      <td>93.4</td>
      <td>96.6</td>
      <td>54.1</td>
      <td>82.5</td>
      <td>89.5</td>
      <td>62.3</td>
      <td>85.2</td>
      <td>91.4</td>
      <td><strong>76.9</strong></td>
    </tr>
    <tr align="center" class="light-blue-background">
      <td><strong>YouCLIP-Large</strong></td>
      <td><strong>73.1</strong></td>
      <td><strong>93.4</strong></td>
      <td><strong>97.5</strong></td>
      <td><strong>70.7</strong></td>
      <td>90.9</td>
      <td>96.5</td>
      <td>92.1</td>
      <td><strong>99.4</strong></td>
      <td><strong>99.9</strong></td>
      <td>77.2</td>
      <td>93.6</td>
      <td>96.6</td>
      <td><strong>56.3</strong></td>
      <td><strong>84.6</strong></td>
      <td><strong>91.1</strong></td>
      <td><strong>64.7</strong></td>
      <td><strong>87.1</strong></td>
      <td><strong>92.9</strong></td>
      <td>----</td>
    </tr>
    <tr align="center">
    <th>Huge</th>
    </tr>
    <tr align="center">
      <td><i>WeCLIP-Huge</i></td>
      <td>65.0</td>
      <td>90.6</td>
      <td>96.5</td>
      <td>70.1</td>
      <td>92.8</td>
      <td>97.4</td>
      <td>88.5</td>
      <td>98.9</td>
      <td>99.5</td>
      <td>76.8</td>
      <td>93.9</td>
      <td>96.6</td>
      <td>50.5</td>
      <td>77.0</td>
      <td>84.5</td>
      <td>64.8</td>
      <td>85.3</td>
      <td>90.2</td>
      <td>----</td>
    </tr>
    <tr align="center" class="light-blue-background">
      <td>YouCLIP-Huge-CN-ENG</td>
      <td>72.0</td>
      <td>93.1</td>
      <td>97.7</td>
      <td>71.2</td>
      <td>92.5</td>
      <td>97.3</td>
      <td>95.2</td>
      <td>99.5</td>
      <td>99.9</td>
      <td>81.3</td>
      <td>96.1</td>
      <td>97.9</td>
      <td>58.1</td>
      <td>86.2</td>
      <td>92.6</td>
      <td>66.5</td>
      <td>88.9</td>
      <td>93.9</td>
      <td><strong>80.94</strong></td>
    </tr>
    <tr align="center" class="light-blue-background">
      <td><strong>YouCLIP-Huge</strong></td>
      <td><strong>73.1</strong></td>
      <td><strong>94.0</strong></td>
      <td><strong>98.1</strong></td>
      <td><strong>71.9</strong></td>
      <td><strong>92.9</strong></td>
      <td><strong>97.4</strong></td>
      <td><strong>95.9</strong></td>
      <td><strong>99.7</strong></td>
      <td><strong>99.9</strong></td>
      <td><strong>82.1</strong></td>
      <td><strong>96.3</strong></td>
      <td><strong>98.0</strong></td>
      <td><strong>58.6</strong></td>
      <td><strong>86.7</strong></td>
      <td><strong>92.7</strong></td>
      <td><strong>68.0</strong></td>
      <td><strong>89.1</strong></td>
      <td><strong>93.7</strong></td>
      <td>----</td>
    </tr>
  </tbody>
</table>


零样本分类ELEVATER Benchmark:

| Subset            | Metric          | WeCLIP-Base | YouCLIP-Base-CN&ENG | YouCLIP-Base | WeCLIP-Large | YouCLIP-Base-512-CN-ENG | YouCLIP-Base-512 | YouCLIP-Large-CN-ENG | YouCLIP-Large | WeCLIP-Huge | YouCLIP-Huge-CN-ENG | YouCLIP-Huge |
|:------------------:|:-----------------:|:-------------:|:---------------------:|:------------:|:--------------:|:--------------------------:|:------------------:|:-----------------------:|:---------------:|:-------------:|:-------------------:|:------------:|
| Average           | -               | 59.4        | 59.3                |     61.2     | 63.2         | 62.6                     | 60.4             | 63.6                  | 64.1          | 68.5        |        66.5         |     65.9     |
| CIFAR-10          | Accuracy        | 97.4        | 92.9                |     92.8     | 98.9         | 93.0                     | 92.9             | 96.6                  | 96.5          | 98.7        |        97.1         |     97.0     |
| CIFAR-100         | Accuracy        | 84.5        | 72.8                |     73.7     | 90           | 73.3                     | 73.6             | 81.4                  | 82.0          | 89          |        82.7         |     82.9     |
| DTD               | Accuracy        | 52.1        | 55.1                |     61.5     | 51.8         | 64.5                     | 58.8             | 57.2                  | 61.0          | 59.5        |        64.1         |     60.1     |
| EuroSAT           | Accuracy        | 61.4        | 48.6                |     44.2     | 63.2         | 46.7                     | 42.6             | 55.0                  | 53.3          | 62          |        58.5         |     63.5     |
| FER               | Accuracy        | 50.8        | 42.8                |     49.6     | 51.2         | 48.7                     | 41.4             | 52.5                  | 51.6          | 51.1        |        53.6         |    51.93     |
| FGVC              | Mean-per-class  | 13.8        | 42.7                |     49.6     | 18.5         | 48.6                     | 54.1             | 59.3                  | 57.8          | 40.1        |        67.5         |    57.94     |
| KITTI             | Accuracy        | 32.9        | 31.8                |     37.6     | 31.4         | 36.0                     | 37.1             | 37.3                  | 36.1          | 25.7        |        29.7         |     21.4     |
| MNIST             | Accuracy        | 60.9        | 83.2                |     84.4     | 72.1         | 84.4                     | 83.1             | 89.1                  | 88.9          | 89.9        |        91.2         |     91.3     |
| PC                | Accuracy        | 50          | 50.0                |     50.0     | 50           | 50.0                     | 50.0             | 50.0                  | 50.0          | 50          |        50.0         |     50.0     |
| VOC               | 11-point mAP    | 81.3        | 22.9                |     22.9     | 81.1         | 22.9                     | 22.9             | 22.8                  | 22.8          | 81.2        |        22.8         |     22.8     |
| Caltech-101       | Mean-per-class  | 91.7        | 93.8                |     90.5     | 93.3         | 94.8                     | 92.3             | 91.9                  | 91.8          | 94          |        93.7         |     94.9     |
| Country-211       | Accuracy        | 19.7        | 15.8                |     17.8     | 26.2         | 21.6                     | 20.5             | 21.8                  | 23.6          | 33.4        |        37.6         |     37.6     |
| Food-101          | Accuracy        | 75.8        | 84.8                |     85.0     | 83.1         | 90.9                     | 88.8             | 90.2                  | 91.1          | 88.3        |        92.9         |     92.6     |
| GTSRB             | Accuracy        | 33.4        | 42.7                |     56.0     | 40.1         | 51.2                     | 44.6             | 48.9                  | 50.7          | 50          |        56.8         |     56.3     |
| Hateful-Memes     | ROC AUC         | 53.4        | 57.7                |     58.4     | 51.6         | 55.3                     | 52.9             | 56.8                  | 52.6          | 53.2        |        54.3         |     55.3     |
| Oxford Flowers    | Mean-per-class  | 70.9        | 48.9                |     47.4     | 80.4         | 65.0                     | 45.9             | 44.6                  | 56.9          | 83.2        |        57.5         |     58.5     |
| Oxford-IIIT Pets  | Mean-per-class  | 74.3        | 90.1                |     89.0     | 83           | 91.3                     | 91.8             | 92.5                  | 94.7          | 94.1        |        93.2         |     94.4     |
| Rendered-SST2     | Accuracy        | 58.2        | 55.0                |     57.5     | 59.4         | 51.6                     | 57.6             | 55.3                  | 55.1          | 65          |        54.9         |     62.1     |
| RESISC-45         | Accuracy        | 65.1        | 64.1                |     64.8     | 68           | 69.0                     | 64.3             | 74.5                  | 71.4          | 71.5        |        76.0         |     73.0     |
| Stanford-Cars     | Accuracy        | 60.6        | 90.4                |     90.7     | 71.5         | 93.5                     | 91.9             | 94.1                  | 94.1          | 89.7        |        95.4         |     94.9     |

