# `manipulator` in complex modalities

This page documents the scripts and Jupyter notebooks needed to perform all the different steps involved in Section F.1, "`manipulator` in complex modalities" of [**A Pattern Language for Machine Learning Tasks**](.).

## Requirements

All the required Python libraries can be installed by running

```bash
pip install -r requirements_llm.txt
```

## Running the experiments

All experiments used the PyTorch implementation of pre-trained models.

### 1. Pretraining the B-GST model

For this step, we followed the steps outlined in [Transforming Delete, Retrieve, Generate Approach for Controlled Text Style Transfer](https://www.aclweb.org/anthology/D19-1322/)'s [README](https://github.com/agaralabs/transformer-drg-style-transfer/blob/master/README.md). This step also prepared the data to be used in subsequent steps in the experiments. We trained the model until we got baselines higher than the original ones reported by the authors (`GLEU=11.6`, `BLEU_SRC=71.0`). It is advisable to clone the [original repo](https://github.com/agaralabs/transformer-drg-style-transfer.git) as we only incorporated their model definitions in this repo to avoid unnecessary duplication.

### 2. Pretraining the `get` of `manipulator`

To prepare the `manipulator` setup, we first trained a HuggingFace [OpenAI GPT](https://api.semanticscholar.org/CorpusID:49313245) model on sentiment classification on the YELP dataset. We used the following command for this:

```bash
python scripts/pretrain_getter.py \
  --output_dir=/path/to/output/dir \
  --train_dataset=/path/to/train/dataset \
  --eval_dataset=/path/to/eval/dataset \
  --seed=42 \
  --num_train_epochs=1 \
  --train_batch_size=32 \
  --eval_batch_size=32 \
  --max_seq_length=85
```

We trained `get` until it got an accuracy of 98% on the test set. This model was not trained for the rest of the steps in the experiment.

### 3. Fine-tuning the `put` of `manipulator`

#### 3.1 `B-GST`-only

This step requires continuing training from Step 1 above.

#### 3.2 `manipulator`-only

This training condition and the next one use the same script for finetuning the B-GST model. The only difference is in the provided `--rule_weights` argument - for `manipulator`-only, we don't set a value for the `"tdrg"` rule weight.

```bash
python scripts/finetune_transformer_dg.py \
  --pretrained_putter_path=/path/to/model/from/step/1 \
  --pretrained_getter_path=/path/to/model/from/step/2 \
  --output_dir=/path/to/output/dir \
  --train_dataset=/path/to/train/dataset \
  --eval_dataset=/path/to/eval/dataset \
  --rule_weights='{"get_put": "20", "put_get": "5", "undo": "20"}' \
  --seed=42 \
  --freeze_getter=True \
  --num_train_epochs=1 \
  --train_batch_size=32 \
  --eval_batch_size=32 \
  --max_seq_length=85 \
  --log_level=20 \
  --logging_step=500
```

#### 3.3 `B-GST+manipulator`

We used the following command to finetune with the combined conditions.

```bash
python scripts/finetune_transformer_dg.py \
  --pretrained_putter_path=/path/to/model/from/step/1 \
  --pretrained_getter_path=/path/to/model/from/step/2 \
  --output_dir=/path/to/output/dir \
  --train_dataset=/path/to/train/dataset \
  --eval_dataset=/path/to/eval/dataset \
  --rule_weights='{"get_put": "10", "put_get": "5", "undo": "25", "tdrg": "30.0"}' \
  --seed=42 \
  --freeze_getter=True \
  --num_train_epochs=1 \
  --train_batch_size=32 \
  --eval_batch_size=32 \
  --max_seq_length=85 \
  --log_level=20 \
  --logging_step=500
```

For reproducibility, run the above two commands without modifying the `--rule_weights` and `--seed` values.

## Evaluation

### 1. Output sentence generation

For evaluation, we did sentiment transfer on the 500 positive and 500 negative sentences in the test set. This set of 1000 sentences has a corresponding human reference gold standard which we used to compute the GLEU and BLEU metrics.

To generate sentences, modify [OpenAI_GPT_Pred.ipynb](https://github.com/agaralabs/transformer-drg-style-transfer/blob/master/OpenAI_GPT_Pred.ipynb) to point to the correct dataset and model directories and/or files.

### 2. GLEU

Similar to the original paper, we used the implementation of GLEU from [GLEU Without Tuning](http://arxiv.org/abs/1605.02592).

```bash
git clone https://github.com/cnap/gec-ranking
cd gec-ranking
python scripts/compute_gleu \
  -s /path/to/source.test.txt \
  -r /path/to/ref.test.txt \
  -o /path/to/output.test.txt
```

The file `/path/to/source.test.txt` should contain the original input sentences to the sentiment transfer step, the `/path/to/ref.test.txt` should contain the human reference, and the `/path/to/output.test.txt` should contain the sentences generated from Step 1 above.

### 3. BLEU

We computed two BLEU scores - one comparing the generated sentences with the source sentence (`BLEU_SRC`) and one comparing the generated sentences with the human reference (`BLEU_REF`). We used the implementation of BLEU adapted in the original paper - this implementation can be seen in [`notebooks/finetuning/eval-yelp.ipynb`](./notebooks/finetuning/eval-yelp.ipynb).

### 4. Sentiment accuracy with FastText

We trained a separate sentiment classifier on the same training-dev-test split using [FastText](https://fasttext.cc). The classifier we trained achieved 98% accuracy on the test set. This was then used to compute the sentiment accuracy reported in the paper.

Excluding the GLEU score, all the metrics reported in our paper can be computed using the [`notebooks/finetuning/eval-yelp.ipynb`](./notebooks/finetuning/eval-yelp.ipynb) notebook.


## Datasets

The version of the Yelp dataset that we used can be downloaded from https://github.com/lijuncen/Sentiment-and-Style-Transfer/tree/master/data.
