# Non-Vacuous Generalization Bounds for Large Language Models

To reproduce the results in the paper *Non-Vacuous Generalization Bounds for Large Language Models*, there are a couple of steps.

1) First, we need to pretrain a GPT-2 model from scratch with the SubLoRA parametrization (associated file: `train.py`). 
2) Then, to obtain non-vacuous bounds, we load the pretrained model, quantize the weights with arithmetic coding, and plug in to the bound code to get the actual empirically optimized bound (associated file: `bound_pipeline.py`). 

Our GPT-2 pretraining code is adapted from https://github.com/karpathy/nanoGPT/tree/master. The transfer learning experiment on GLUE is based on https://github.com/microsoft/LoRA/blob/main/examples/NLU/examples/text-classification/run_glue_no_trainer.py

## Environment Installation

```bash
conda env create -f environment.yml
```

## Pretrain GPT-2 with SubLoRA

To perform pretraining, we first need to download and preprocess the OpenWebText dataset (See more details in https://github.com/karpathy/nanoGPT/blob/master/README.md). You can get the dataset by running

```bash
cd nanoGPT/
python data/openwebtext/prepare.py
```

If the python environment and OpenWebText are taken care of, we can start training. It's recommended to pretrain GPT-2 with the SubLoRA parametrization using 4 GPUs in parallel. Also, this version of the code only supports multi-GPU training within a single node. 

In the case of a single GPU training, here is the bash command

```bash
cd nanoGPT/
python train.py config/train_gpt2_LoRA_better_hparams.py --dataset_dir=[PATH_TO]/nanoGPT/data --intrinsic_dim=25000 --learning_rate=5e-3  --attention_linear_lora_r=4 --linear_head_lora_r=4
```

If we have 4 GPUs within a single node, here is the command 

```bash
cd nanoGPT/
torchrun --standalone --nproc_per_node=4 train.py config/train_gpt2_LoRA_better_hparams.py --dataset_dir=[PATH_TO]/nanoGPT/data --intrinsic_dim=25000 --learning_rate=5e-3  --attention_linear_lora_r=4 --linear_head_lora_r=4
```

The above example uses an intrinsic dimension for subspace training of 25000 and a learning rate of 5e-3. More details for these hyperparameter setup can be found in the paper. 

## Quantization and Computing Generalization Bounds

Once we have the pretrained model checkpoint, we can perform quantization and the bounds computation with the following command

```bash
cd nanoGPT/
python bound_pipeline.py config/train_gpt2_LoRA_bounds.py --dataset_dir=[PATH_TO]/nanoGPT/data --intrinsic_dim=25000 --learning_rate=5e-3 --levels=17 --best_checkpoint_path=[PATH_TO_BEST_CHECKPOINT]
```


## Transfer Learning

Additionally, to obtain the transfer learning generalization bound on GLUE, we can run the following finetuning experiments on GLUE

```bash
#!/bin/bash

intrinsic_dim=$1
load_pretrained_model=$2
task_name=$3

cd LoRA_ft/

python run_glue_no_trainer.py   --model_name_or_path gpt2   --task_name \"$task_name\"   --max_length 128       --pad_to_max_length   --per_device_train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 5   --output_dir [PATH_TO]/PAC_Bayes_LoRA/LoRA_ft/results/\"$task_name\"_\"$intrinsic_dim\"_\"$load_pretrained_model\"     --cache_dir [A_CACHE_FOLDER] --intrinsic_dim=\"$intrinsic_dim\"   --load_pretrained_model=\"$load_pretrained_model\"

```

This script performs SubLoRA finetuning on one of the GLUE datasets. Notice that `intrinsic_dim` is the dimensionality of the subspace parametrization, `load_pretrained_model=0` means starting from a randomly initialized GPT-2 model while `load_pretrained_model=1` refers to starting from a pretrained standard GPT-2 checkpoint from huggingface. `task_name` should be one of the GLUE datasets. For the transfer learning experiment, we're only considering `task_name` to be either `qqp` or `cola`. 

After finetuning, we can compute the finetuning accuracy generalization bound with the following command

```bash
python bound_pipeline_finetune.py config/finetune_gpt2_LoRA_bounds.py --intrinsic_dim=\"$intrinsic_dim\" --learning_rate=5e-3 --levels=17 --best_checkpoint_path=[PATH_TO]/PAC_Bayes_LoRA/LoRA_ft/results/\"$task_name\"_\"$intrinsic_dim\"_\"$load_pretrained_model\" --task_name=\"$task_name\" --rank_lora=8 --wandb_log=False  --load_pretrained_model=\"$load_pretrained_model\"
```

For the rebuttal experiment, we only ran `qqp` and `cola`. `cola` by default only have Matthew's correlation coefficient as its metrics. For classification accuracy on `cola`, we just read off the running accuracy during the bound computation. For `qqp`, accuracy is logged to wandb during the execution of the `run_glue_no_trainer.py` script. 
