## Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

## Requirements

* Python libraries: See [requirements.txt](./requirements.txt) for exact library dependencies. You can use the following commands with Miniconda3 to create and activate your Python environment:
  - `conda create --name smoothie python=3.9`
  - `conda activate smoothie`
  - `conda install pip`
  - `pip install -r requirements.txt`
  - `python -m spacy download en`

## Dataset loading

For Quasar-T datasets you first need to download files `train.json`, `valid.json` and `test.json` from [DiffuSeq github](https://github.com/Shark-NLP/DiffuSeq/tree/main) and put them in the `./datasets/` folder.

Then you should run the following command:
```
python -m data.load --dataset_name=dataset_name
```

For any other dataset used in the paper, you can run the command above without downloading anything.

The `'dataset_name'` is one of the following:
 - `'rocstories'`
 - `'qqp'`
 - `'xsum'`
 - `'paradetox'`
 - `'quasar_t`


## Diffusion training

To train basic Smoothie setup, run

```
torchrun --nproc_per_node=n train_diffusion.py --dataset_name dataset_name --smooth_diffusion
```

This script will train Smoothie model used in the paper.
