# ZERO: A Large-scale Chinese Cross-modal Benchmark with a New Vision-Language Framework

This is the PyTorch code of our method.


## Requirements:
- install the below packages.
    - python 3.8.12
    - pytorch 1.11.0
    - transformers 4.15.0
    - timm 0.4.12

- run 
<pre/>pip install -r requirements.txt</pre> 


## Inference demo:
- Image-text retrieval


## Finetuned checkpoints:
Task | R2D2 (ZERO-Corpus) w/ ViT-B 
--- | :---: 
Image-Text Retrieval (Flickr30k-cn) | <a href="https://drive.google.com/drive/folders/1xlPmdtw3T1H4x1glfvJH6KqIfXZSUkc6?usp=sharing">Download</a>
Image-Text Retrieval (COCO-cn) | <a href="https://drive.google.com/drive/folders/1nbnrw4Ns2v3lktFSeyewgMUBIq0M_VUi?usp=sharing">Download</a>


##  Evaluation on Image-Text Retrieval Task:
- Download public Flickr30k-cn and coco-cn datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.

- To evaluate the finetuned R2D2 model on Flickr30k-cn, run:
    <pre>python -m torch.distributed.run --nnodes=1 --nproc_per_node=8 eval_r2d2_retrieval.py \
    --config ./configs/retrieval_flickr.yaml \
    --output_dir output/flickr --checkpoint checkpoints/flickr30-cn-finetune/checkpoint_best.pth \
    --evaluate</pre> 

- To evaluate the finetuned R2D2 model on coco-cn data, run:
    <pre>python -m torch.distributed.run --nnodes=1 --nproc_per_node=8 eval_r2d2_retrieval.py \
    --config ./configs/retrieval_coco.yaml \
    --output_dir output/coco --checkpoint checkpoints/coco-finetune/checkpoint_best.pth \
    --evaluate</pre> 


## LICENSE
### Dataset

- We conduct the pre-training task on our proposed ZERO-Corpus dataset and the downstream tasks on six public cross-modal datasets and the proposed five downstream datasets.

- The six public cross-modal datasets consist of Flickr30k-CN, COCO-CN, AIC-ICC, ECommerce-T2I, and MUGE.

- As stated by the proposer, we follow the MIT License for Flickr30k-CN and COCO-CN, ECommerce-T2I, and MUGE.

- As stated by the proposer, we follow the Apache License for AIC-ICC.

- The researchers using our proposed pre-training dataset and downstream datasets should follow the Apache License.


### Code

- The researchers using our code should follow the Apache License.