Requirements
pytorch 1.10.2
transformers 4.8.1
timm 0.4.9
bert_score 0.3.11

Prepare datasets and models
Download the datasets, Flickr30k （https://shannon.cs.illinois.edu/DenotationGraph/） and MSCOCO （https://cocodataset.org/#home） (the annotations are provided in ./data_annotation/), and put them into ./Dataset. 

The checkpoints of the fine-tuned VLP models are accessible in CLIP（https://huggingface.co/openai/clip-vit-base-patch16, and put them into ./checkpoint.


Run and test
python eval_clip-vit2clip-cnn.py
