# FewGen

The source code for ICLR 2023 submission #3797.

## Requirements

Before running, first install the repository dependecies:

```bash
conda env create -f environment.yml
```

This will create a virtual environment by the name of `fewgen`. Prior to executing the code, activate the environment:

```bash
conda activate fewgen
```

## Overview

**FewGen** is a **Few**-Shot **Gen**eration method for few-shot learning on NLU tasks. **FewGen** trains an autoregressive PLM on a few training samples to generate training data, then trains a classifier for the task via both few-shot samples and generated samples.

<img src="./FewGen.png" width="1000px"></img>

**Training and Test Data**: We follow [LM-BFF](https://github.com/princeton-nlp/LM-BFF) for the train/dev/test set split of the GLUE tasks.

**Pretraining Corpus**: We provide the processed pretraining corpus (Wikipedia and OpenWebText) for generating training data for 
sequence-pair tasks under the [`pretrain_corpus`](pretrain_corpus) directory; see the [README file](pretrain_corpus/README.md) there for 
details.

## Few-Shot Data Setup

Download the original train/dev/test sets for GLUE tasks:

```bash
cd data
bash download_data.sh 
cd ..
```

Generate few-shot samples for generator training (16 samples per-class, per-task).:

```bash
python gen_k_shot.py
```


## Generating Training Data

**Generator Training**: Prior to generating training data for the classifier PLM, the generator must first be trained using
the few-shot samples available for each task. There is a script ([`train_gen.sh`](train_gen.sh)) provided to perform this training; basic usage is as follows:

```bash
bash train_gen.sh $GPU_ID $TASK $MODEL $SEED
```

*Example*: Training a generator for the MNLI task, on GPU 0, with seed 13:

```bash
bash train_gen.sh 0 MNLI ctrl 13
```

**Data Generation**: There is also a script ([`gen_train_data.sh`](gen_train_data.sh)) for generating training data using the generator trained above; basic usage is as follows:

```bash
bash gen_train_data.sh $GPU_ID $TASK $MODEL_PATH $PRE_TRAIN_CORPUS_PATH $NUM_GEN_PER_CLASS $SEED
```

*Example (continued)*: If we would like to use our trained MNLI generator:

```bash
bash gen_train_data.sh 0 mnli train_gen_all_label_disc_final_13/MNLI/ctrl-prefix-infix-5e-3-2-20-meta-weight pretrain_corpus/wiki_short.txt 5000 13
```

**Generated File Combination**: The above script will automatically generate data for all of the labels for the given task. These files need to be combined into
one JSON file to facilitate classifier training. This can be done with the provided script ([`combine_gen.sh`](combine_gen.sh)); 
basic usage is as follows:

```bash
bash combine_gen.sh $SEED $TASK
```

*Example (continued)*: Combining our MNLI training files:

```bash
bash combine_gen.sh 13 MNLI
```

## Classifier Fine-Tuning

The entry script for fine-tuning a classifier on few-shot and generated data is [`finetune.sh`](finetune.sh). The basic usage is as follows:

```bash
bash finetune.sh $GPU_ID $TASK $SEED
```
*Example (continued)*: If we would like to train for classification on the MNLI task:

```bash
bash finetune.sh 0 MNLI 13
```

