# Injecting knowledge into language generation: a case study in auto-charting after-visit care instructions from medical dialogue

## Requirements

See the requirements.txt

```
pip install -r requirements.txt
```

## Environment variables

In order to utilize the scripts provided, one should set the following environment variables:

```bash
export EXPERIMENTS_DIRECTORY=
export DATA=
export DATA_DBA=
```

Where `${DATA}` stands for the path to pre-processed data (in a binary format) for training, and `${DATA_DBA}` refers to the path to the binarized data for the DBA baseline. 

Data for the DBA baseline contains sources of the following form:

```
token1 token2 ... tokenT\tconcept1\tconcept2\concept3...\conceptC
```

`\t` refers to the `TAB` delimiter, and `concept` can be a compound medical concepts. For the DBA baseline we use only concepts found in source with the concept weight of at least 0.6. 

## Data Preprocessing

We use standard `fairseq-preprocess` method for the data preprocessing:

```
fairseq-preprocess \
    --source-lang en --target-lang en \
    --trainpref ${TRAIN_BPE} \
    --validpref ${VALID_BPE} \
    --testpref ${TEST_BPE} \
    --destdir ${DATA} \
    --joined-dictionary \
    --workers 20
```

## Installing fairseq

Please use the fairseq with the following commit to avoid issues with updated codebase:

`git clone https://github.com/pytorch/fairseq.git`

`git checkout c6006678261bf5d52e2c744508b5ddd306cafebd`

Install fairseq:

```
cd fairseq
pip install --editable ./
```

## Training and validating the models

After setting the environment variables, you can execute training with the following command:

```
python ${PYTHON_SWEEP_PATH} --call_fn prepare_configuration --sweep_step ${SWEEP_STEP} | xargs python ${FAIRSEQ_MODULE}/train.py
```

To execute the validation use:

```
python ${PYTHON_SWEEP_PATH} --call_fn validate_trained_sweep_ontest --experiment_name_to_validate utilization_rate --sweep_step ${SWEEP_STEP} --beam ${BEAM} | xargs python ${FAIRSEQ_MODULE}/validate.py
```

Where `$PYTHON_SWEEP_PATH` is a path to the `sweep_utils/prepare_configuration.py` file, `$SWEEP_STEP` is the number of the setup (from 1 to 75), and `$BEAM` is a desirable beam size for the generation. `$SWEEP_STEP` = 1 corresponds to the baseline.

## DBA experiments 

For DBA experiments we use the `LexicallyConstrainedBeamSearch`the following way:

```
fairseq-generate ${DATA_DBA} --path ${EXPERIMENTS_DIRECTORY}/utilization_rate/sweep_step_1/checkpoint_best.pt \
    --tokenizer moses --bpe fastbpe --max-len-a 1.2 --max-len-b 10 \
    --beam 100 --nbest 1 \
    --source-lang en --target-lang en \
    --remove-bpe --constraints unordered \
```