# Code Instruction

## Requirement
```
conda create -n demovlp python=3.8
source activate demovlp 
pip install -r requirements
```

## Data
### Download Pre-trained model
- ViT checkpoint
```
mkdir pretrained
cd pretrained
wget -c https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p16_224-80ecf9dd.pth
```
- distilbert-base-uncased  
```
cd pretrained
mkdir distilbert-base-uncased
```
Download all files from huggingface [distilbert-base-uncased][2], and put them into `pretrained/distilbert-base-uncased`  

### Pre-train datasets
- WebVid  
Refer to Github repo [WebVid](https://github.com/m-bain/webvid).

- CC3M  
Refer to [Conceptual Captions Website](https://ai.google.com/research/ConceptualCaptions/download).

> Note: `meta_data/webvid_validation_success_full.tsv` and `meta_data/cc3m_validation_success_full.tsv` are used to load videos and captions. The train split are also organized in the same way. Due to the file size of train split metadata files (e.g., `meta_data/webvid_validation_success_full.tsv`, `meta_data/cc3m_validation_success_full.tsv`) is too large, we didn't give them in this repo.   

### Downstream datasets
- MSRVTT
```
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip
```

### Extract region feature
We adopt [bottom-up-attention][6] to extract region features for all datasets.  
To save time and storage consumption, we uniformly sample 8 frames for each video in WebVid and extract region features for each frame.  

To help organize all these dataset, we give a snapshot of how we structure these data folders as follows:
- WebVid  
```
|--WebVid
|----train
|--------000001_000050
|------------1066674784.mp4
|------------...
|------------1066724161.mp4
|--------...
|--------199951_200000
|----val
|----region_features_all
|--------train
|------------000001_000050
|----------------1066674784
|--------------------1.npz
|--------------------...
|--------------------8.npz
|------------...
|--------val
```
Other data folders of downstream tasks are structured in a similar way to WebVid.  

- CC3M
```
|--CC3M
|----training
|----validataion
|--------0_1595581236
|--------...
|--------9999_352904708
|----region_features_all
|--------train
|--------val
|------------0000
|----------------0_1595581236_1.npz
|----------------...
|----------------998_856795266_1.npz
|------------0015
```

## Pre-train
Specify `data_dir` and `object_dir` in the config file to directories that contain raw videos and region features.   
```
python -m torch.distributed.launch --nproc_per_node 8 --master_port 2912 train_dist_multi.py --config configs/pt/o2t-cl-local-select-loss-cc.json -sc 30 40
```

> Note: If you use clusters to run a distributed training, please be careful to set environment variables (e.g., master_port, master_address, world_size, rank). In our experiment, we usually use 4 nodes (i.e., 32 GPUs) to conduct the pre-training.

## Downstream tasks
Specify `data_dir` and `object_dir` in the config file to directories that contain raw videos and region features.  
Specify `load_checkpoint` to the pre-trained checkpoint file.  
- MSRVTT Retrieval
```
python -m torch.distributed.launch --nproc_per_node 2 --master_port 2912 train_dist_multi.py --config configs/ft/msrvtt_o2t-select.json -sc 2 4 8
```

[1]: https://arxiv.org/abs/2203.07720
[2]: https://huggingface.co/distilbert-base-uncased/tree/main
[3]: https://github.com/m-bain/frozen-in-time
[4]: https://github.com/kuanghuei/SCAN
[5]: https://github.com/CrossmodalGroup/BFAN
[6]: https://github.com/MILVLG/bottom-up-attention.pytorch
[7]: https://mega.nz/file/Yi4igDib#e8M5mwFEYXkGMv9nye9aoDoYr2neTwKx_DZyy6f1qyQ
[8]: https://mega.nz/file/ZiQSkRJb#CmoyCQKePbynJh1uKQT5I0iQ91ZTkQzJO4ecYlAyDuE
[9]: https://drive.google.com/file/d/11wdvsTYIPcSTRMVry1tufILiNE4aAMp5/view?usp=sharing