# Collaborative heads: NMT experiments

To reproduce our experiment, clone fairseq repository and apply the code patch.

```bash
git clone https://github.com/pytorch/fairseq.git
cd fairseq
git checkout 8dde7de8a22d6e59c4101fe0de618f888c33ba81
git apply collaborative_heads.patch
pip install --editable ./
# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./
```

Download and preprocess the data following these [instructions](https://github.com/pytorch/fairseq/tree/master/examples/scaling_nmt).

Reproduce our experiments with the following command:

```bash
# set COLAB to "none" to run the original transformer
KEY_DIM=512 COLAB="encoder_cross_decoder" CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py data-bin/wmt16_en_de_bpe32k \
    --arch transformer_wmt_en_de \
    --save-dir checkpoints/wmt16-en-de/base-d-$KEY_DIM-colab-$COLAB \
    --share-all-embeddings \
    --optimizer adam \
    --adam-betas '(0.9, 0.98)' \
    --clip-norm 0.0 \
    --lr 0.0007 \
    --min-lr 1e-09 \
    --lr-scheduler inverse_sqrt \
    --warmup-updates 4000 \
    --warmup-init-lr 1e-07 \
    --dropout 0.1 \
    --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --max-tokens 3584 \
    --update-freq 2 \
    --fp16 \
    --collaborative-heads $COLAB \
    --key-dim $KEY_DIM \
```
