## Prerequisites

First create a conda environment with the required packages 

    conda env create -f env.yml

## Reproducing Results of our paper

First, download the MS-COCO data and the Flickr30k data and store them in ```data/coco```, and ```data/flickr30k```, respectively.
You can download the MS-COCO data by

    cd data
    mkdir mscoco && cd mscoco
    wget http://images.cocodataset.org/zips/train2014.zip
    unzip train2014.zip
    wget http://images.cocodataset.org/zips/val2014.zip
    unzip val2014.zip
    cd ../..
    
Also, apply for access to the [Flickr30k dataset](https://shannon.cs.illinois.edu/DenotationGraph/) and save the images to ```./datasets/flickr30k```.
Further, you will need to download the train/val/test set annotations for both datasets [here](https://cs.stanford.edu/people/karpathy/deepimagesent/) and save them to the ```annotations``` directory.
Parse both datasets by

    python data_prep/parse_coco.py
    python data_prep/parse_flickr30k.py

For computing the different mappings, first, you will need to extract the CLIP language embeddings for captions

    python data_prep/prepare_embeddings.py --datadir datasets/mscoco --data mscoco --vis-encoder RN50x64

This will run for a while and extract caption embeddings for all CLIP backbones and save them to ```data/```.
Before computing the mapping, you will need to run

    python -m spacy download en_core_web_sm
   
This will download and install the english spacy pipeline used for stop-word removal.
Then execute

    python align_captions.py --dataset mscoco --vis-encoder RN50x64 
    
The ```--dataset``` arguments can be set to either ```mscoco``` or ```flickr30k```.
Finally, you can generate captions for the MS-COCO datasets on the respective test splits via

    python generate_captions.py --k 18 --mscoco --vis-encoder RN50x64 --train-method linear_reg --decoding greedy    

For generating captions for the Flickr30k datasets, simply set ```--datadir data/flickr30k/imgs_test.pkl``` and ```--flickr30k```.
The hyperparameter ```k``` denotes the number of captions provided in the prompt.
Currently, decoding supports ```greedy```, ```sampling```, ```nucleus```, and ```topk```.


## Generated Captions

The generated captions are stored as a json file in a new directory named ```results```. 
Our metrics (BLEU, CIDEr-D, Rouge-L, SPICE) are computed using the code from [here](https://github.com/tylin/coco-caption).
The required annotation files for computing these scores can be found in the ```annotations/``` directory.