# SwitchLoRA

<img width="600" alt="SwitchLoRA" src="./switchlora.png">

## Enviroment setup

1. Install pytorch following https://pytorch.org/get-started/locally/

2. Clone repository & Install dependencies

   ```shell
   cd path_to_this_repository
   pip install -r requirements.txt
   pip install flash-attn
   ```

Our code has been test on `Ubuntu 22.04 LTS`. For details on the installed packages, please refer to `requirements.txt.ubuntu`.

## Download data

```shell
cd llama
# Download data and preprocess it to 512 sequence len
python download_data.py --save_dir preprocessed_data_512 --tokenizer t5-base --dataset allenai/c4 --dataset_config en --take 46000000 --text_field text --sequence_length 512
# Download data and preprocess it to 256 sequence len
python download_data.py --save_dir preprocessed_data_256 --tokenizer t5-base --dataset allenai/c4 --dataset_config en --take 15000000 --text_field text --sequence_length 256
```

## Options

The following are options related to the SwitchLoRA paper, with additional options in the code for testing purposes.

- `use_lora`: Whether to use LoRA adapter. If `switch_lora` is specified, `use_lora` will be set to `true` automatically
- `switch_lora`: Whether to use SwitchLoRA
- `lora_rank`: LoRA rank
- `lr`: Learning rate
- `adam_warm_step`: How many steps to freeze LoRA vectors when their counterpart vectors are switched. Set to `5` by default
- `switch_lora_interval`: Initial value of switch interval($interval_0$)
- `switch_lora_descent_rate`: $ratio$ in the SwitchLoRA paper. It determines the point at which the switching frequency is reduced to one-third of its initial value, occurring at the step $total\_step \times ratio$
- `init_lora_type`: Can be `"origin_lora"` or `"switchlora"`. Set to string `"origin_lora"` to test initialization method of vanilla LoRA
- `--offload_candidates`: Whether to offload spare candidate vectors to CPU.
- `--continuous_switch`: Whether to switch contiguous indices of candidate vectors together as described in the appendix of the SwitchLoRA paper.

## Experiments

The scripts for experiments described in the SwitchLoRA paper are located in the directory [./llama/examples/](./llama/examples).

- **Basic experiments**

  Examples scripts for the 350M LLaMA model are provided. The scripts can be found in [./llama/examples/basic](./llama/examples/basic).

  For models with different size detailed in the SwitchLoRA paper, change the parameters in the scripts: `--model_config`,  `--lr`, `--total_batch_size`, `--lora_rank`.

- **Comparison with other methods**

  We conduct experiments for [GaLore](https://github.com/jiaweizzhao/GaLore) [1] and [ReLoRA](https://github.com/Guitaricet/relora.git) [2].

  **For GaLore experiments**, clone the repository and install the required packages:

  ```shell
  git clone https://github.com/jiaweizzhao/GaLore.git
  cd GaLore
  pip install -r exp_requirements.txt
  ```

  Next, use the scripts located in `./llama/examples/galore` to run the experiments. (New data will be downloaded since the data collator used by GaLore differs from ours)

  **For ReLoRA experiments**, clone the repository and set up the environment:

  ```shell
  cd path_to_this_repository
  git clone https://github.com/Guitaricet/relora.git
  cd relora
  pip install -e .
  pip install flash-attn
  ```

  Begin by running full pre-training with the script located at `./llama/examples/relora/run_full_250m.sh`. Following this, use `./llama/examples/relora/run_relora_250m.sh` to train the model using ReLoRA on the checkpoint from the full pre-training at the 1,000th step. Additionally, employ `./llama/examples/relora/run_switchlora_on_full250m.sh` to train the same checkpoint with SwitchLoRA.

- **Reasoning ability comparison**

  To run full pre-training with SwitchLoRA, first integrate the LoRA adapters into the original model weights:
  
  ```shell
  cd path_to_this_repository
  torchrun --master_port 14202 --nproc-per-node 1 convert_checkpoint.py --model_config configs/llama_350m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_512 --batch_size 72 --total_batch_size 1152 --max_length 512 --num_training_steps 40000 --num_workers 8 --lora_rank 256 --lora_dropout 0.  --switch_lora --switch_lora_descent_rate 0.1 --zero_switch_step_state  --zero_switch_state --save_dir checkpoints/llama_350m_switchlora_512_batch1152_lr0.02_rate0.1_lora256_step40000 --autoresume True
  ```
  
  Then execute the following to run a specific GLUE task:
  
  ```shell
  # mrpc is GLUE task name. 3e-5 is the learning rate
  bash ./llama/examples/glue/run_glue_full_switchlora_256.sh mrpc 3e-5
  ```
  
  For other experiments, adjust the `--model_name_or_path` parameter to different checkpoints as shown in `./llama/examples/glue/run_glue_full_full.sh`.

## References

[1] Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReloRA: High

rank training through low-rank updates. In *Workshop on Advancing Neural Network Training:*

*Computational Efficiency, Scalability, and Resource Optimization (WANT@NeurIPS 2023)*, 2023.

[2] Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian.

Galore: Memory-efficient LLM training by gradient low-rank projection. *CoRR*, abs/2403.03507,

2024b. doi: 10.48550/ARXIV.2403.03507.
