# Experiments record

## 130m witch 2968248 data
### experiment 1
llama(commit: 101a54e09)
switchlora(commit: 8594c654)

- larger warm step, large adam_warm_step local
```python
export CUDA_VISIBLE_DEVICES=0,1 
torchrun --nproc-per-node 2 main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced --batch_size 300 --total_batch_size 600 --lr 6e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 250 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.3 --switch_lora_interval 40 --lora_rank 128 --save_dir=checkpoints/large_warm_step --adam_warm_step 10  
```
step=4000 -> training loss = 4.51

- no drop rate 
```python
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 300 --total_batch_size 600 --lr 6e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 250 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.3 --switch_lora_interval 10 --lora_rank 128 --switch_lora_drop 0  
```
Tested with normal results(slightly better than case switch_lora_drop=0.5)

- large drop rate
```python
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 300 --total_batch_size 600 --lr 6e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 250 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.3 --switch_lora_interval 10 --lora_rank 128 --switch_lora_drop 0.9
```
Tested with bad results 

- negative drop rate
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 300 --total_batch_size 600 --lr 6e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 250 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.3 --switch_lora_interval 10 --lora_rank 128 --switch_lora_drop -1  
similar to no drop rate case

- large negative drop rate
```python
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 300 --total_batch_size 600 --lr 6e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 250 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.3 --switch_lora_interval 10 --lora_rank 128 --switch_lora_drop -10 --save_dir checkpoints/llama_130m_256_switchlora_dropminus10  
```
Tested with normal results

- 4e-3 learning rate
```python
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 300 --total_batch_size 600 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 250 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.3 --switch_lora_interval 10 --lora_rank 128 --save_dir checkpoints/llama_130m_256_switchlora_lr0.004
```
Tested with normal results
step=4500 -> training loss = 4.5177

- smaller descend_rate
```python
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 300 --total_batch_size 600 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 250 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.05 --switch_lora_interval 10 --lora_rank 128 --save_dir checkpoints/llama_130m_256_switchlora_des0.05
```
Tested with good results
step=2500 -> training loss = 4.78

- enable candidates drop
```python
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 300 --total_batch_size 600 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 250 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.05 --switch_lora_interval 20 --adam_warm_step 10 --lora_rank 128 --drop_switch_lora_candidates --save_dir checkpoints/llama_130m_256_switchlora_drop
```
Tested with good results
step=4947 -> test loss = 3.85
step=4500 -> training loss = 4.38


- fixed interval, enable candidates drop
```python
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 300 --total_batch_size 600 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 250 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.05 --switch_lora_interval 20 --adam_warm_step 10 --lora_rank 128 --drop_switch_lora_candidates --save_dir checkpoints/llama_130m_256_switchlora_drop
```
Tested with very bad results

- small batch size test
```python
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 1e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --save_dir checkpoints/llama_130m_256_batch128 
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 5e-4 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --save_dir checkpoints/llama_130m_256_batch128lr0.0005 
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 2e-4 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --save_dir checkpoints/llama_130m_256_batch128lr0.0002   
```
step=20000 for all three tests
lr=1e-3 -> training loss=4.303, test loss=3.96
lr=5e-4 -> training loss=4.025, test loss=3.71
lr=2e-4 -> training loss=4.268, test_loss=3.97
Hence, lr=5e-4 for batch size=128

- small batch size switch lora test
```python
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.02 --switch_lora_interval 10 --lora_rank 128 --save_dir checkpoints/llama_130m_256_switchlora_batch128 
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 1e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.02 --switch_lora_interval 10 --lora_rank 128 --save_dir checkpoints/llama_130m_256_switchlora_batch128_lr0.001 
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 6e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.02 --switch_lora_interval 10 --lora_rank 128 --save_dir checkpoints/llama_130m_256_switchlora_batch128_lr0.006   
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.02 --switch_lora_interval 2 --adam_warm_step 1 --lora_rank 128 --save_dir checkpoints/llama_130m_256_switchlora_batch128warm1
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.01 --switch_lora_interval 2 --adam_warm_step 1 --lora_rank 128 --save_dir checkpoints/llama_130m_256_switchlora_batch128warm1rate0.01
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.04 --switch_lora_interval 2 --adam_warm_step 1 --lora_rank 128 --save_dir checkpoints/llama_130m_256_switchlora_batch128warm1rate0.04
```
step=20000 for all tests
lr=4e-3 -> training loss=4.2086, test loss=3.878
lr=1e-3 -> training loss=4.4768, test loss=4.13
lr=6e-3 -> training loss=4.2242, test loss=3.89
lr=4e-3, warm1 -> training loss=4.201, test loss=3.854
lr=4e-3, warm=1, switch_lora_descent_rate=0.04 -> training loss=4.230, test loss=3.874
lr=4e-3, warm=1, switch_lora_descent_rate=0.01 -> training loss=4.189, test loss=3.8485

- small batch size lora test
```python
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --use_lora --lora_rank 128 --save_dir checkpoints/llama_130m_256_lora_batch128 
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 1e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --use_lora --lora_rank 128 --save_dir checkpoints/llama_130m_256_lora_batch128_lr0.001
```
step=20000 for all two tests
lr=4e-3 -> training loss=4.24, test loss=3.9
lr=1e-3 -> training loss=4.575, test loss=4.17

### experiment 2

```python
# warm 0 test(adam_warm_step=-1)
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.01 --switch_lora_interval 1 --adam_warm_step -1 --lora_rank 128 --save_dir checkpoints/llama_130m_256_switchlora_batch128warm0
# zero init B
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.01 --switch_lora_interval 2 --adam_warm_step 1 --lora_rank 128 --save_dir checkpoints/llama_130m_256_switchlora_batch128fixB
# zero init B no drop
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.01 --switch_lora_interval 2 --adam_warm_step 1 --zero_init_B --lora_rank 128 --switch_lora_drop 0 --save_dir checkpoints/llama_130m_256_switchlora_batch128fixB_drop0
# zero init B keep candidates
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.01 --switch_lora_interval 2 --adam_warm_step 1 --zero_init_B --lora_rank 128 --switch_lora_drop -100 --save_dir checkpoints/llama_130m_256_switchlora_batch128fixB_drop100
```
step=20000 for all tests
adam_warm_step=-1 -> training loss=, test loss=3.8905
zero init B -> training loss=, test loss=3.8486
zero init B, drop=0 -> training loss=, test loss=3.8484
zero init B, drop=100 -> training loss=, test loss=


```python
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.001 --lora_rank 128 --switch_lora_interval 2 --adam_warm_step 1 --switch_lora_drop 0 --zero_init_B --save_dir checkpoints/llama_130m_256_switchlora_batch128descentRate0.001  # not saved
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.1 --lora_rank 128 --switch_lora_interval 2 --adam_warm_step 1 --switch_lora_drop 0 --zero_init_B  --switch_descend_type Z --save_dir checkpoints/llama_130m_256_switchlora_batch128Zrate0.1
python main.py --model_config configs/llama_130m.json --dataset_path preprocessed_data/allenai/c4_en_t5-base_256_reduced_$GPU_NUM --batch_size 128 --total_batch_size 128 --lr 4e-3 --max_length 256 --num_training_steps 20000 --save_every 500 --eval_every 500 --keep_checkpoints 3 --num_workers 8 --switch_lora --switch_lora_descent_rate 0.1 --lora_rank 128 --switch_lora_interval 2 --adam_warm_step 1 --switch_lora_drop 0 --zero_init_B  --switch_descend_type Z --save_dir checkpoints/llama_130m_256_switchlora_batch128Zrate0.05
```
Z type rate 0.01 -> training loss = 4.1702, test loss = 3.8314 (best switch lora up to now)
descent rate=0.001 -> training loss = , test loss = 3.827
