# Introduction

This repository contains the code for the paper [Bag-of-Vectors Autoencoders for Unsupervised Conditional Text Generation] by anonymous authors.

The paper extends the Emb2Emb framework to work with variable-size embeddings.
It consists of three stages as shown below. In the pretraining stage, a text autoencoder is trained on unlabeled data.
In the task training stage, a mapping is trained that maps the embedding of the input to the embedding of the output.
In the inference stage, the encoder, mapping, and decoder are combined to solve the task.

![Emb2Emb Training Framework](images/pnpframework.png)

# Requirements

The were run under Debian 10, Python3.8.8 and PyTorch 1.8

The required Python packages can be installed via
```bash
pip install -r requirements_iclr.txt
```

# Code Structure

The code consists of the `autoencoders` package, which handles the pretraining of an
RNN autoencoder, and the `emb2emb` package, which handles the task training and inference
stages.

The main training files are the following:

* autoencoders/trainer.py: Handles the autoencoder pretraining
* emb2emb/train.py: Handles Emb2Emb training, as well as training the classifier for emb2emb.

## emb2emb

Implements task training and inference stages.

* trainer.py: Implements the workflow of Emb2Emb.
* architectures.py: Implements the mapping Phi, in particular the BovToBovMapping class, which has several parameters that turn on the Transformer++ variant.
* losses.py: Defines losses as in the paper, particularly the LocalBagLoss, which allows to backprop at every timestep or in a window of given size.
* hausdorff.py: Handles computation of the (differentiable) Hausdorff distance
* classifier.py: Trains binary classifiers on top of the embedding to be used for style loss

## autoencoders
Implements autoencoder pretraining.

* base_encoder.py and base_ar_decoder.py implement the basics of an autoregressive encoder-decoder autoencoder
* transformer_encoder.py: implements a Transformer encoder
* transformer_decoder.py: implements a Transformer decoder
* autoencoder.py: Besides encoding and decoding, manages regularization methods like adding noise to the input or minimizing the L0Drop loss.
* data_loaders.py: Preprocesses text data into HDF5 files for quicker training afterwards.
* noise.py: Computes noise for denoising autoencoders.
* l0drop.py: A layer for performing L0Drop.
* 

# Tutorial

There are two steps: We first need to pretrain the autoencoder on the text data without labels, and then
train Emb2Emb on the data with labels.

## Pretrain autoencoder

### Prepare data for autoencoder training.
First, you need to preprocess your data and bring it into the HDF5 file format expected by the autoencoder training script.
To this end, concatenate s1.train and s2.train from into 'all.train' and process that file, which contains all texts available at training time, 1 sentence per line,
via the following command:

```bash
cd autoencoders
python data_loaders.py <my-dataset>/all.train <my-dataset>/<dataset>.h5 64 -t CharBPETokenizer -mw 30000
```

### Build config file
Autoencoder training is configured through a config file, for which autoencoders/config/default.json is a good template.

### Train
After configuring the config file, you can train the autoencoder with the following command:
```bash
python trainer.py iclr_experiments/<my-dataset>/<my-file>.json
```

### ICLR experiments
We added the config files for autoencoders presented in our study in the config/iclr_configs/ folder, which can be used to reproduce the results.

## Train the Emb2Emb mapping

First, you need to train a style classifier / length regressor. For training on Gigaword, this looks something like this:

```bash
cd ../emb2emb/
python train.py --embedding_dim 128 --batch_size 64 --lr 0.00005 --modeldir <path-to-modeldir> --data_fraction 1.0 --test_data_fraction 1.0 --n_epochs 0 --n_layers 1 --dataset_path <../data/abssum/> --emb2emb bovtobov --validate --validation_frequency -1 --loss localbagloss --al_bag_loss hausdorff length --al_bag_loss_weights 1.0 1.0 --outputdir <output-dir-path> --outputmodelname <path-to-save-classifier> --heads 1 --output_file <path-to-save-results-csv> --project_input_dimension 128 --unaligned --binary_classifier_path no_eval --al_weighting window --al_weighting_center input --al_softmax_temp 1.0 --al_windowsize 0 --al_input_center_factor 0.3 --al_differentiable --n_layers_binary 1 --hidden_size_binary 128 --n_epochs_binary 3 --binary_dense_layer_size 64 --max_length 250 --point_gen --max_input_length 250 --point_gen_offset --train_classifier_only --dropout_binary 0.5 --lr_bclf 0.0001
```
Note the --train_classifier_only option, which prevents training downstream happening right away.

Now you are ready to train the actual task:

```bash
cd ../emb2emb/
python train.py --embedding_dim 128 --batch_size 64 --lr 0.00005 --modeldir <path-to-modeldir> --data_fraction 1.0 --test_data_fraction 1.0 --n_epochs 10 --n_layers 1 --dataset_path <../data/abssum/> --emb2emb bovtobov --validate --validation_frequency -1 --loss localbagloss --al_bag_loss hausdorff length --al_bag_loss_weights 1.0 <lambda_len> --outputdir <output-dir-path> --outputmodelname <path-to-save-classifier> --heads 1 --output_file <path-to-save-results-csv> --project_input_dimension 128 --unaligned --binary_classifier_path no_eval --al_weighting window --al_weighting_center input --al_softmax_temp 1.0 --al_windowsize 0 --al_input_center_factor 0.3 --al_differentiable --n_layers_binary 1 --hidden_size_binary 128 --n_epochs_binary 3 --binary_dense_layer_size 64 --max_length 250 --point_gen --max_input_length 250 --point_gen_offset
```

# Experiments and results
We added *.csv files for each of the experiments in our study to the iclr_experiments folder. The csv files does not only contain results, but also all training parameters used to obtain those results, which can be used to reproduce them.
