# GPT-2 Experiments for RAdaGrad and RAdamW Optimizer


This repository builds on original Preconditional LoRA project [Preconditioned LoRA (Zhang et al., 2024)](https://arxiv.org/abs/2402.02347). 
<!-- <p>
<img src="figures/score.png" width="800" >
</p> -->

<!-- We also evaluate for varying LoRA ranks, different model sizes. Here we demonstrate our experiment for E2E dataset (Table 1 result). Follow [LoRA](https://github.com/microsoft/LoRA/tree/main) repository for expeirments for other datasets. -->

## Repository Overview

* [examples/NLG/src/](examples/NLG/src) contains the source code used for data processing, training, and decoding.
* [examples/NLG/eval/](examples/NLG/eval) contains the code for task-specific evaluation scripts.
* [examples/NLG/data/](examples/NLG/data) contains the raw data we used in our experiments.
* [examples/NLG/vocab/](examples/NLG/vocab) contains the GPT-2 vocabulary files.
* [loralib/](loralib) contains the lora library implementation.

## Requirements

```
# Python 3.8.20
# CUDA 11.0
# pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
transformers==3.3.1
spacy
tqdm
tensorboard
progress
regex
```
Our experiments were conducted on older versions of Python and pytorch, so there are some usages that will be different from newer versions. Specifically there are:
+ In gpu.py: local_rank 
+ In optimizer_custom.py: For a matrix $X$ with SVD $X = USV^T$, torch.svd(X) returns $U, S, V$. But for newer version's pytorch, torch.linalg.svd returns $U, S, V^T$.
## Quickstart

Clone the repo and run the following command
 ```
cd examples/NLG
 pip install -r requirement.txt
 bash download_pretrained_checkpoints.sh
 bash create_datasets.sh
 cd ./eval
 bash download_evalscript.sh
 cd ../../..
 python setup.py develop
conda install openjdk=8
 ```


## E2E Experiment
1.  Enter experiment folder
```
cd examples/NLG
```

2. Train GPT-2 small with RAdamW optimizer (see our paper for hyperparameters)
```
python -m torch.distributed.launch --nproc_per_node=1 src/gpt2_ft.py \
    --train_data ./data/e2e/train.jsonl \
    --valid_data ./data/e2e/valid.jsonl \
    --train_batch_size 8 \
    --grad_acc 1 \
    --valid_batch_size 4 \
    --seq_len 512 \
    --model_card gpt2.sm \
    --init_checkpoint ./pretrained_checkpoints/gpt2-pytorch_model.bin \
    --platform local \
    --clip 0.0 \
    --lr 8e-3 \
    --weight_decay 0.01 \
    --correct_bias   \
    --adam_beta1 0.9 \
    --adam_beta2 0.98 \
    --adam_epsilon 1e-3 \
    --scheduler linear  \
    --warmup_step 500 \
    --max_epoch 5   \
    --save_interval 4000  \
    --lora_dim 4   \
    --lora_alpha 32  \
    --lora_dropout 0.1  \
    --label_smooth 0.1   \
    --work_dir ./trained_models/GPT2_S/e2e  \
    --random_seed 110  \
    --trial_name rie_adamw_experiment_r4 \
    --opt rie_adamw
```
Here <code>rie_adagrad, riegrad</code> are all valid choices for <code>--opt</code>.

2. Generate output
```
python -m torch.distributed.launch --nproc_per_node=1 src/gpt2_beam.py \
    --data ./data/e2e/test.jsonl \
    --batch_size 1 \
    --seq_len 512 \
    --eval_len 64 \
    --model_card gpt2.sm \
    --init_checkpoint ./trained_models/GPT2_S/e2e/model_e2e_rie_adamw__experiment_r4.26290.pt \
    --platform local \
    --lora_dim 4 \
    --lora_alpha 32 \
    --beam 10 \
    --length_penalty 0.8 \
    --no_repeat_ngram_size 4 \
    --repetition_penalty 1.0 \
    --eos_token_id 628 \
    --work_dir ./trained_models/GPT2_S/e2e \
    --output_file predict_e2e_rie_adamw_experiment_r4.jsonl
```
3. Decode outputs from step (2)
```
python src/gpt2_decode.py \
    --vocab ./vocab \
    --sample_file ./trained_models/GPT2_S/e2e/predict_e2e_rie_adamw_experiment_r4.jsonl \
    --input_file ./data/e2e/test_formatted.jsonl \
    --output_ref_file e2e_ref.txt \
    --output_pred_file e2e_pred.txt
```

4. Run evaluation on E2E test set
```
python eval/e2e/measure_scores.py e2e_ref.txt e2e_pred.txt -p
```

## Parameter Reference 
### RGD Results
```
python -m torch.distributed.launch --nproc_per_node=1  src/gpt2_ft.py  \
   --train_data ./data/e2e/train.jsonl   \
   --valid_data ./data/e2e/valid.jsonl  \
   --train_batch_size 8  \
   --grad_acc 1   \
   --valid_batch_size 4  \
   --seq_len 512  \
   --model_card gpt2.sm \
   --init_checkpoint ./pretrained_checkpoints/gpt2-pytorch_model.bin  \
   --platform local  \
   --clip 0.0  \
   --lr 8e-2 \
   --weight_decay 0.0001 \
   --correct_bias   \
   --scheduler linear  \
   --warmup_step 500 \
   --max_epoch 5   \
   --save_interval 4000  \
   --lora_dim 4   \
   --lora_alpha 32  \
   --lora_dropout 0.1  \
   --label_smooth 0.1   \
   --work_dir ./trained_models/GPT2_S/e2e  \
   --random_seed 110  \
   --trial_name riemannian_gd_experiment_r4 \
   --opt riegrad
```
### RAdaGrad Results
```
python -m torch.distributed.launch --nproc_per_node=1  src/gpt2_ft.py  \
   --train_data ./data/e2e/train.jsonl   \
   --valid_data ./data/e2e/valid.jsonl  \
   --train_batch_size 8  \
   --grad_acc 1   \
   --valid_batch_size 4  \
   --seq_len 512  \
   --model_card gpt2.sm \
   --init_checkpoint ./pretrained_checkpoints/gpt2-pytorch_model.bin  \
   --platform local  \
   --clip 0.0  \
   --lr 5e-3 \
   --weight_decay 0.01 \
   --adam_epsilon 1e-3 \
   --correct_bias   \
   --adam_beta2 0.98 \
   --scheduler linear  \
   --warmup_step 500 \
   --max_epoch 5   \
   --save_interval 4000  \
   --lora_dim 4   \
   --lora_alpha 32  \
   --lora_dropout 0.1  \
   --label_smooth 0.1   \
   --work_dir ./trained_models/GPT2_S/e2e  \
   --random_seed 110  \
   --trial_name rie_adagrad_experiment_r4 \
   --opt rie_adagrad
```
### RAdamW Results
```
python -m torch.distributed.launch --nproc_per_node=1  src/gpt2_ft.py  \
   --train_data ./data/e2e/train.jsonl   \
   --valid_data ./data/e2e/valid.jsonl  \
   --train_batch_size 8  \
   --grad_acc 1   \
   --valid_batch_size 4  \
   --seq_len 512  \
   --model_card gpt2.sm \
   --init_checkpoint ./pretrained_checkpoints/gpt2-pytorch_model.bin  \
   --platform local  \
   --clip 0.0  \
   --lr 8e-3 \
   --weight_decay 0.01 \
   --correct_bias   \
   --adam_beta1 0.9 \
   --adam_beta2 0.98 \
   --adam_epsilon 1e-3 \
   --scheduler linear  \
   --warmup_step 500 \
   --max_epoch 5   \
   --save_interval 4000  \
   --lora_dim 4   \
   --lora_alpha 32  \
   --lora_dropout 0.1  \
   --label_smooth 0.1   \
   --work_dir ./trained_models/GPT2_S/e2e  \
   --random_seed 110  \
   --trial_name rie_adamw_experiment_r4 \
   --opt rie_adamw
```
