# Text Structure Predcition (TSP)
TSP is a self-supervised inter-sentence pre-training task that exploits text structure information to facilitate language understanding for language models. For the details, please refer to our paper [Improving Language Model Pretraining with Text Structure Information]().

# Setup
```
conda env create -f conda_environment.yml
conda activate tsp
```
# Pre-training

```
Python Pretraining/run.py [OPTION] SETTING OVERWRITES...
```

**`OPTION`** :
- `--resume_ckpt_path=[PATH]` : path to a pre-trained pytorch-lightning-compatible checkpoint. If not specified, the pre-training will start from scratch.

**`SETTING`** :  take `SETTING`, which is pre-defined in `Pretraining/configuration.py`, as the basic configuration.

**`OVERWRITES`** : overwrite the configuration defined by `SETTING`. These arguments will be passed to `OmegaConf.from_cli` and should be specified in the format of YAML (see [Omegaconf's document](https://omegaconf.readthedocs.io/en/2.2_branch/usage.html?highlight=from_cli#from-command-line-arguments) for details). Besides, an argument has higher priority over its previous arguments. The possible arguments is written in `_abstract_task/configuration.py` and `Pretraining/configuration.py` Among these arguments, the following are required:

- `scale` : apply the pre-defined values for hyperparameters related to the scale of experiments (e.g. batch size, training steps, etc.). The possible choices are `small`, `base`, and `large`.
- `seed` :  set a random seed, which should be larger than zero. This is for comparison and replicability on the same machine, you can assign a random number if have no business with them.
- `device` : the accelerator (usually cuda) devices to be used. See [pytorch lightning's document](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#devices) for details.


## Examples

```
Python Pretraining/run.py mlm_tsp scale=small seed=1 devices="[0]" logger=wandb
```

After the pre-training finished, it will save the checkpoint as `.../tsp/checkpoints/mlm_tsp_small_seed=1.ckpt`

```
Python Pretraining/run.py mlm_tsp scale=base seed=1 devices=4 batch_size=64 strategy="deepspeed_stage_1" logger=wandb
```
It picks 4 cuda devices in current machine to pre-train the model with a total batch size of 64 x 4 = 256. It also apply deepspeed strategy to acclerate the training and reduce the memory cost.

```
Python Pretraining/run.py --resume_ckpt_path "/../tsp/checkpoints/mlm_tsp_large_seed=2_epoch=8.deepspeed" mlm_tsp scale=large devices=4 batch_size=72 strategy="deepspeed_stage_1" logger=wandb
```
We can resume the interrupted training by specify `resume_ckpt path`. The checkpoint ends with ".deepspeed" here is a deepspeed compatible checkpoint for resuming deepspeed-enabled training. Note that hen we apply both the large scale and deepspeed, it will save a deepspeed compatible checkpoint at every epoch. 

# Fine-tuning

```
Python GLUE/run.py [OPTION] TASKS OVERWRITES...
```

**`OPTION`** :
- `--test` : create prediction outputs on the test sets. If not specified, it will perform fine-tuning.

**`TASKS`** :  a string of identifiers of tasks to be fine-tuned, which is delimited by commas.

**`OVERWRITES`** : Similar to `OVERWRITES` in pre-training. The possible arguments is written in `_abstract_task/configuration.py` and `SuperGLUE/configuration.py` Among these arguments, the following are required:

- `load_ckpt_path` : pre-trained checkpoint's path that is relative to `.../tsp/checkpoints`.  

- `scale` : It sould be the same with the `scale` of the pre-trained checkpoint.

- `device` : the accelerator (usually cuda) devices to be used.

## Examples
```
Python SuperGLUE/run.py "rte,cb,copa,multirc,wic,boolq,record" load_ckpt_path="mlm_tsp_base_seed=1.ckpt" scale=base logger=wandb devices="[0]"
```

This finetunes 10 runs for each specified tasks from the same pre-trained checkpoint. It will save the fine-tuned checkpoints under a automatically created directory `.../tsp/checkpoints/mlm_tsp_base_seed=1.finetuning` .

```
Python SuperGLUE/run.py --test "rte,cb,copa,multirc,wic,boolq,record" load_ckpt_path="mlm_tsp_base_seed=1.finetuning" scale=base devices="[0]"
```
Set `load_ckpt_path` as a directory is a special use for performing testing on multiple tasks. For each task, it will automatically try to find the only one checkpoint that the task identifier is the file name's prefix and use that checkpoint as the fine-tuned checkpoint for testing. It will create a directory `.../tsp/checkpoints/mlm_tsp_base_seed=1.finetuning/test_outputs` to put the prediction file in. You can collect prediction files and submit the official evaluation server.