<h1 align="center"> <p>HiddenKey</p></h1>



This repo supports the paper "HiddenKey: Parameter-Efficient FineTuning Meets Dropout under a Unified Framework", which is under review. The project files combine [huggingface/peft (v0.3.0)](https://github.com/huggingface/peft/tree/v0.3.0), [huggingface/transformers (v4.29.2)](https://github.com/huggingface/transformers/tree/v4.29.2) and [microsoft/LoRA](https://github.com/microsoft/LoRA) repositories, all of which are publicly available in GitHub, and all rights are reserved for the original authors.

## Overview

The emerging powerful capabilities exhibited by large language models (LLMs) have established them as a fundamental element in various applications that rely on advanced language understanding. At the same time, fine-tuning has become the standard learning approach to adapting LLMs to a concrete application (e.g., instruction tuning, alignment tuning, and task/user-specific specialization). Due to the high cost associated with full finetuning, parameter-efficient finetuning (PEFT) methods, especially LoRA, have gained popularity due to their lower storage, memory, and computation requirements. However, the possible contradiction between limited trainable parameters and the dropout regularization methods (which aim at alleviating overfitting associated with excessive parameter redundancy), has been largely overlooked. With extensive experiments of LoRA-based PEFT, we first confirm that PEFT is also overfitting-prone. We then revisit transformer-specific dropout methods, and validate their equivalence and differences mathematically and empirically. To facilitate a comprehensive comparison, we introduce a unified framework to instantiate them along dropping position, structural pattern and compensation measure, and uncover their new preferences and performance comparisons in PEFT scenarios. This framework also enables us to integrate the best of all into a new dropout method named HiddenKey, which shows performance superiority over existing methods on both NLU and NLG tasks. Compared to baselines, it also achieves better performance with less finetuning time, and offers continuous improvement with further finetuning. These highlight HiddenKey as the better practice for high-performance and parameter-efficient finetuning of LLMs.

## Environment Setting

```bash
# create conda environment
conda create -n peft python=3.9 --yes
conda activate peft && conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

# NLU
cd peft  # project root
pip install -e ./transformers
pip install -e .
pip install evaluate accelerate bitsandbytes

# NLG
cd exps && bash create_datasets.sh  # prepare datasets
cd ./eval && bash download_evalscript.sh && cd ..
pip install apex pyter3 bert_score 
conda install future matplotlib nltk

cd ./NLG/eval/GenerationEval # prepare evaluation for NLG tasks
git clone https://github.com/google-research/bleurt.git
cd bleurt && pip install . && pip install razdel tabulate
```

## Getting Started

### NLU

#### RoBERTa Large

We implement our analysis with RoBERTa-large on most of the GLUE benchmarks, including `MRPC, RTE, CoLA, SST-2, STS-B, QNLI and MNLI`, which covers diverse sizes and tasks for more robust conclusions. For each task, we prepare an independent script under `./exp` named `loop_<name of dataset>_roberta_large.sh`.

Here’s an example of RTE in script `loop_rte_roberta_large.sh`：

```bash
#!/bin/bash

###################  Modified params  ##########################
modified_dropout_pattern=$1
modified_dropout_rate=$2
modified_aug_loss=$3
modified_aug_loss_weight=$4
GPU_ID=$5
project_root=<project_root>
python_path=<python_path>

###################  LoRA params  ##########################
task_type=SEQ_CLS    # Task type
inference_mode=False # Whether to use inference mode
r=8                  # Lora attention dimension
lora_alpha=16        # Lora alpha
lora_dropout=0.0     # Lora dropout

###################  Data  ##########################
task_name=rte
max_seq_length=512

###################  Model  ##########################
model_name_or_path=roberta-large

###################  Training params  ##########################
num_train_epochs=30
per_device_train_batch_size=16
per_device_eval_batch_size=16
gradient_accumulation_steps=4
learning_rate=4e-4
warmup_ratio=0.06
weight_decay=0.1
metric_for_best_model=accuracy
greater_is_better=True
disable_tqdm=True
run_name=glue.${TASK_NAME}

#######################  Run  ############################
export PYTHONPATH=${project_root}:$PYTHONPATH
...
```

Before running a script, you need to specify the `project_root` and `python_path` first. The former is the root path of the current project, while the latter refers to the absolute path of python interpreter in the conda environment. 

The most significant parameters for our analysis are `modified_dropout_pattern`, `modified_dropout_rate`, `modified_aug_loss` and `modified_aug_loss_weight`. 

- `modified_dropout_pattern`: **dropping positions and structural patterns**, which can be 
  - **Blank**: diable all dropout
  - **One** of  `hiddencut_element`, `hiddencut_column`, `hiddencut_span`, `dropkey_element`, `dropkey_column`, `dropkey_span`, `dropattn_element`, `dropattn_column`, `dropattn_span`, `drop_input` and `drop_classifier`.
  - **Tuple of the above options separated by commas**: utilize multiple dropout methods mentioned above simultaneously.

- `modified_dropout_rate`: **dropout rates  separated by commas** for corresponding `modified_dropout_pattern`. 
  - The element number of them should be the same, because each dropout pattern has a dropout rate. It should be blank if `modified_dropout_pattern` is blank. 
  - Every value should be in [0, 1), where 0 will disable the corresponding dropout methods.

- `modified_aug_loss`: **augmented loss**, which can be `none`, `kl` and `js`, representing `no augmented loss`, `Bidirectional Kullback-Leibler (KL) divergence loss` and` Jensen-Shannon (JS) consistency loss`. 
- `modified_aug_loss_weight`: **weight of augmented loss**.

Besides, you can also specify the GPU to run with `GPU_ID`.

Example:


```bash
modified_dropout_pattern=hiddencut_element,dropkey_column
modified_dropout_rate=0.05,0.1
modified_aug_loss=kl
modified_aug_loss_weight=0.5
```

This configuration will employ element-wise HiddenCut and column-wise DropKey with dropout rates of 0.05 and 0.1, respectively. It will also augment the loss with KL loss weighted by 0.5. 

For each script, the default configuration will loop the LoRA-based PEFT for five times with random seeds from 0 to 4. You can modify this by setting `seed_min` and `seed_max`, and the experiment will be repeated for each seed in `[seed_min, seed_max]`.

Run the above config on RTE dataset with GPU 0 by: 

```bash
cd ./exps
bash loop_rte_roberta_large.sh hiddencut_element,dropkey_column 0.05,0.1 kl 0.5 0
```

You can also run all possible configurations on other datasets similarly.

Note: Evaluation has already been included in the finetuning process by setting `do_eval=True` in the config file.

#### LLaMA-7B

Running LoRA-based PEFT with the LLaMA-7B model is almost identical to that of RoBERTa-large. The relevant configurations are also under the folder `./exps`, named `loop_<name_of_the_task>_llama.sh`.

Run PEFT with LLaMA-7B model by
```bash
cd ./exps
bash loop_rte_llama.sh hiddencut_element,dropkey_column 0.05,0.1 kl 0.5 0
```

This will finetune five LLaMA-7B models on the RTE dataset with different random seeds on GPU 0. Element-wise HiddenCut and column-wise DropKey are applied with dropout rate as 0.05 and 0.1, respectively, and KL loss is weighted with 0.5. 

### NLG

For simplicity, we set the same way to configure experiments on NLG and NLU tasks. The scripts are in the folder `./exps/NLG`, where `run_gpt2_e2e.sh` and `run_gpt2_webnlg.sh` are prepared for E2E and WebNLG datasets, respectively. For each script, the PEFT will be repeated for three times with random seeds in [0, 2], and the models are evaluated automatically after the finetuning process.  

For example, run LoRA-based PEFT on E2E dataset: 

```bash
cd ./exps/NLG
bash run_gpt2_e2e.sh hiddencut_element,dropkey_column 0.05,0.1 kl 0.5 0
```

while for the WebNLG dataset, the following command can be used:

```bash
cd ./exps/NLG
bash run_gpt2_webnlg.sh hiddencut_element,dropkey_column 0.05,0.1 kl 0.5 0
```
## License

[Apache License 2.0](LICENSE)