## VoLTA

This repo contains the example pretraining code for paper: "VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment". 


### Setting up Conda Environment
Setup a virtual conda environment using the provided `requirements.txt` file.
```
conda create -n ssl_env
conda activate ssl_env
conda install pip
pip install -r requirements.txt
```
### Download Pre-training Dataset
- **COCO2014**: Download [2014 train images](http://images.cocodataset.org/zips/train2014.zip), [2014 val images](http://images.cocodataset.org/zips/val2014.zip) and [karpathy split](https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip).
```
data_root
    ├── train2014
    │   ├── COCO_train2014_000000250351.jpg
    │   ├── COCO_train2014_000000250352.jpg
    │   └── ...
    ├── val2014
    │   ├── COCO_val2014_000000165984.jpg
    │   ├── COCO_val2014_000000166003.jpg
    │   └── ...
    └── dataset_coco.json
```

- **COCO2017**: Our pre-training dataset, [mscoco2017](https://academictorrents.com/details/74dec1dd21ae4994dfd9069f9cb0443eb960c962) train split, is a dataset of ~120K image-caption pair. We download the dataset using [img2dataset](https://github.com/ShramanPramanick/img2dataset).
```
pip install img2dataset
wget https://huggingface.co/datasets/ChristophSchuhmann/MS_COCO_2017_URL_TEXT/resolve/main/mscoco.parquet
img2dataset --url_list mscoco.parquet --input_format "parquet"\
            --url_col "URL" --caption_col "TEXT"\
            --output_folder mscoco --processes_count 16 --thread_count 64 --image_size 384\
            --enable_wandb True
``` 

### Pre-training
Pre-training time is approximately 100 hours using 32 A100 GPUs. 

```
cd VoLTA
python main.py --batch_size 1024 --data_root_coco <path_to_coco> --epochs 50 --maxlen 30  --print_freq 100
```
