# Autoregressive order inference

## Installation

To install this package, first download the package from github, then install it using pip.

```bash
git clone git@github.com:brandontrabucco/indigo.git
pip install -e indigo
```

You must then install helper packages for word tokenization and part of speech tagging. Enter the following statements into the python interpreter where you have installed our package.

```python
import nltk
nltk.download('punkt')
nltk.download('brown')
nltk.download('universal_tagset')
```

Finally, you must install the natural language evaluation package that contains several helpful metrics.

```bash
pip install git+https://github.com/Maluuba/nlg-eval.git@master
nlg-eval --setup
```

You can now start training a non-sequential model!

## Setup - Captioning

In this section, we will walk you through how to create a training dataset, using COCO 2017 as an example. In the first step, you must have downloaded COCO 2017. The annotations should be placed at `~/annotations` and the images should be placed at `~/train2017` and `~/val2017` for the training and validation set respectively.

Create a part of speech tagger first.

```bash
python scripts/data/create_tagger.py --out_tagger_file tagger.pkl
```

Extract COCO 2017 into a format compatible with our package. There are several arguments that you can specify to control how the dataset is processed. You may leave all arguments as default except `out_caption_folder` and `annotations_file`.

```bash
python scripts/data/extract_coco.py --out_caption_folder ~/captions_train2017 --annotations_file ~/annotations/captions_train2017.json
python scripts/data/extract_coco.py --out_caption_folder ~/captions_val2017 --annotations_file ~/annotations/captions_val2017.json
```

Process the COCO 2017 captions and extract integer features on which to train a non sequential model. There are again several arguments that you can specify to control how the captions are processed. You may leave all arguments as default except `out_feature_folder` and `in_folder`, which depend on where you extracted the COCO dataset in the previous step.

```bash
python scripts/data/process_captions.py --out_feature_folder ~/captions_train2017_features --in_folder ~/captions_train2017 --tagger_file tagger.pkl --vocab_file train2017_vocab.txt --min_word_frequency 5 --max_length 100
python scripts/data/process_captions.py --out_feature_folder ~/captions_val2017_features --in_folder ~/captions_val2017 --tagger_file tagger.pkl --vocab_file train2017_vocab.txt --max_length 100
```

Process images from the COCO 2017 dataset and extract features using a Faster RCNN FPN backbone. Note this script will distribute inference across all visible GPUs on your system. There are several arguments you can specify, which you may leave as default except `out_feature_folder` and `in_folder`, which depend on where you extracted the COCO dataset.

```bash
python scripts/data/process_images.py --out_feature_folder ~/train2017_features --in_folder ~/train2017 --batch_size 4
python scripts/data/process_images.py --out_feature_folder ~/val2017_features --in_folder ~/val2017 --batch_size 4
```

Finally, convert the processed features into a TFRecord format for efficient training. Record where you have extracted the COCO dataset in the previous steps and specify `out_tfrecord_folder`, `caption_folder` and `image_folder` at the minimum.

```bash
python scripts/data/create_tfrecords.py --out_tfrecord_folder ~/train2017_tfrecords --caption_folder ~/captions_train2017_features --image_folder ~/train2017_features --samples_per_shard 4096
python scripts/data/create_tfrecords.py --out_tfrecord_folder ~/val2017_tfrecords --caption_folder ~/captions_val2017_features --image_folder ~/val2017_features --samples_per_shard 4096
```

The dataset has been created, and you can start training.

## Setup - WMT

Here, we use WMT16 Ro-En as an example.

```bash
CUDA_VISIBLE_DEVICES=0 python extract_wmt.py --language_pair 16 ro en --data_dir {dataroot}
cd {dataroot}/wmt16_translate/ro-en
subword-nmt learn-joint-bpe-and-vocab --input src_raw_train.txt tgt_raw_train.txt -s 32000 -o joint_bpe.code --write-vocabulary src_vocab.txt tgt_vocab.txt
subword-nmt apply-bpe -c joint_bpe.code --vocabulary src_vocab.txt --vocabulary-threshold 50 < src_raw_train.txt > src_train.BPE.txt
subword-nmt apply-bpe -c joint_bpe.code --vocabulary src_vocab.txt --vocabulary-threshold 50 < src_raw_validation.txt > src_validation.BPE.txt
subword-nmt apply-bpe -c joint_bpe.code --vocabulary tgt_vocab.txt --vocabulary-threshold 50 < tgt_raw_train.txt > tgt_train.BPE.txt
subword-nmt apply-bpe -c joint_bpe.code --vocabulary tgt_vocab.txt --vocabulary-threshold 50 < tgt_raw_validation.txt > tgt_validation.BPE.txt
cd {folder_with_INDIGO_repo}
CUDA_VISIBLE_DEVICES=0 python scripts/data/process_wmt.py --out_feature_folder {dataroot}/wmt16_translate/ro-en --data_folder {dataroot}/wmt16_translate/ro-en --vocab_file {dataroot}/wmt16_translate/ro_en_vocab.txt --dataset_type train
CUDA_VISIBLE_DEVICES=0 python scripts/data/process_wmt.py --out_feature_folder {dataroot}/wmt16_translate/ro-en --data_folder {dataroot}/wmt16_translate/ro-en --vocab_file {dataroot}/wmt16_translate/ro_en_vocab.txt --dataset_type validation
CUDA_VISIBLE_DEVICES=0 python scripts/data/create_tfrecords_wmt.py --out_tfrecord_folder {dataroot}/wmt16_translate/ro-en --feature_folder {dataroot}/wmt16_translate/ro-en --samples_per_shard 4096 --dataset_type train
CUDA_VISIBLE_DEVICES=0 python scripts/data/create_tfrecords_wmt.py --out_tfrecord_folder {dataroot}/wmt16_translate/ro-en --feature_folder {dataroot}/wmt16_translate/ro-en --samples_per_shard 4096 --dataset_type validation
```

## Training

You may train several kinds of models using our framework. For example, you can replicate our results and train a non-sequential soft-autoregressive Transformer-InDIGO model using the following command in the terminal.

```bash
python scripts/train.py --train_folder ~/train2017_tfrecords --validate_folder ~/val2017_tfrecords --batch_size 32 --beam_size 1 --vocab_file train2017_vocab.txt --num_epochs 10 --model_ckpt ckpt/nsds --embedding_size 256 --heads 4 --num_layers 2 --first_layer region --final_layer indigo --order soft --iterations 10000
```

## Validation

You may evaluate a trained model with the following command. If you are not able to install the `nlg-eval` package, the following command will still run and print captions for the validation set, but it will not calculate evaluation metrics.

```bash
python scripts/validate.py --validate_folder ~/val2017_tfrecords --ref_folder ~/captions_val2017 --batch_size 32 --beam_size 1 --vocab_file train2017_vocab.txt --model_ckpt ckpt/nsds --embedding_size 256 --heads 4 --num_layers 2 --first_layer region --final_layer indigo
```
