This is the codebase of our paper "Tailoring Language Generation Models under Total Variation Distance". Most of the experiments are conducted with fairseq commandlines, and our method is implemented in `fairseq/criterions/tailor.py`.

## Synthetic Experiments

To replicate the synthetic experiment, first go to `examples/synthetic`.

The COCO dataset can be downloaded from [here](https://github.com/CR-Gjx/LeakGAN/tree/master/Image%20COCO/save). 

### Preprocess

1. Use the vocabulary file `vocab_cotra.pkl` (load with pickle) to replace the token id in `realtrain_cotra.txt` and `realtest_coco.txt`. 

2. Split the first 5000 lines of `realtrain_cotra.txt ` into the dev set. Rename the three files into `train.tgt`, `test.tgt` and `dev.tgt`, and move them to the new directory `data/coco`. 

3. Create three new files containing "<go>" on each line and rename them to `train.src`, `test.src` and `dev.src`. Note that the number of lines in each source file should be the same as the corresponding target file.

4. Preprocess the data by running: `bash binarize.sh data/coco`

	It should generate processed data in `data/coco-bin`.

### Generate Synthetic Data

1. Train the oracle model by running `bash train_oracle.sh`. The model checkpoint is saved in `models/coco-mle-4096-lr1e-3-ep50`.

2. To sample synthetic data from the oracle model, first create the source file following point 3 in Preprocessing, e.g., `data/coco/train.src`. 

3. Then running command: `bash generate.sh models/coco-mle-4096-lr1e-3-ep50 _best train.src 0 data/coco`

	The generated result will be saved as `models/coco-mle-4096-lr1e-3-ep50/train.gen._best`. 

Create new directory `data/coco_pseudo` to save the synthetic data generated by the oracle model. In our paper, we sampled 10K data for training and 5K for evaluation.

### Train

Train the MLE model run: `bash train_mle.sh`

Train the TaiLr model run: `bash train_tv.sh`

### Evaluation

**PPL_oracle**

1. First generate samples from the trained model (in the paper we generate 20K samples), and preprocess them following the previous steps (saved as `test` split). Then create directories to save the result, e.g., `data/coco_mle` and `data/coco_mle-bin`.

2. To evaluate the result using the oracle model, run command: `bash eval_lm.sh models/coco-mle-4096-lr1e-3-ep50 50 test 0 data/coco_mle`

**PPL_test**

1. In the paper, we sampled 20K samples from the oracle model for evaluation. The evaluation is similar to PPL_oracle. The only reminder is to replace the evaluated model with the trained model.

**BLEU-4**

Run command: `python bleu.py --s <GEN_FILE> --r <GT_FILE>`

**SelfBLEU-4**

Run command: `python selfbleu.py --s <GEN_FILE>`

## Machine Translation

To replicate the machine translation experiment, first go to `examples/translation`.

Preprocess the data following https://github.com/facebookresearch/fairseq/tree/main/examples/translation

Train model by running: `bash train_mle.sh` or `bash train_tailor.sh`

Generate samples by running: `bash generate.sh <CKPT_DIR> 0 test`

## Summarization

To replicate the summarization experiment, first go to `examples/bart`.

Download preprocessed data from https://huggingface.co/datasets/gigaword

Train model by running: `bash train_giga.sh` or `bash train_giga_tailor.sh`

Generate samples by running: `bash generate.sh <CKPT_DIR> <EPOCH> <DATA_FILE> 0`

## Long Text Generation

To replicate the long text generation experiment, first go to `examples/bart`.

Download data from here https://dl.fbaipublicfiles.com/fairseq/data/writingPrompts.tar.gz.

Train model by running: `bash train_wp.sh` or `bash train_wp_tailor.sh`

Generate samples by running: `bash sample.sh <CKPT_DIR> <EPOCH> <DATA_FILE> 0`




