Requirements:
pytorch 1.10.2
transformers 4.8.1
timm 0.4.9
bert_score 0.3.11

Prepare datasets and models:
Download the datasets, Flickr30k （https://shannon.cs.illinois.edu/DenotationGraph/） and MSCOCO （https://cocodataset.org/#home） (the annotations are provided in ./data_annotation/), and put them into ./Dataset. 

The checkpoints of the fine-tuned VLP models are accessible in CLIP（https://huggingface.co/openai/clip-vit-base-patch16), ALBEF(https://github.com/salesforce/ALBEF), TCL(https://github.com/uta-smile/TCL), and put them into ./checkpoint.


Run and eval:
python SEA_eval.py
