# Code for our ICLR 2024, Model Pruning with Model Transfer.

## Running Environment

In this code, you can run our model on CIFAR-100, CUB-200 and INaturalist-2018 datasets.
The code has been tested by Pythhon 3.6.9, Pytorch 1.10.0 and CUDA 11.1.

## 1. Datasets
CUB-200, CIFAR-100 and INaturalist-2018 datasets can be downloaded from official websites. Among them, CUB-200 and INaturalist-2018 need to be preprocessed after downloading. 
### 1.1 CUB-200
After the CUB-200 dataset is downloaded, extract the compressed package to obtain the following directory structure folder. (All images are distributed in subfolders within the images folder. The subfolder name is the category to which the image belongs.)
```shell
CUB_200_2011
├── CUB_200_2011
│   ├── attributes
│   ├── images
│   │   ├── 001.Black_footed_Albatross
│   │   ├── 002.Laysan_Albatross
│   │   ├── ...
│   │   └── 200.Common_Yellowthroat
│   ├── parts
│   │   ├── datasets
│   │   ├── samplers
│   │   └── transforms
│   ├── bounding_boxes.txt
│   ├── classes.txt
│   ├── image_class_labels.txt
│   ├── images.txt
│   ├── README
│   └── train_test_split.txt
└── attributes.txt
```
After data processing, the files in the CUB_200_2011 folder are regrouped to form the following directory structure folder. (The main reorganization operation is set up in the images folder.)
```shell
CUB_200_2011
├── CUB_200_2011_new
│   └── images
│       ├── train
│       │   ├── 001.Black_footed_Albatross
│       │   ├── 002.Laysan_Albatross
│       │   ├── ...
│       │   └── 200.Common_Yellowthroat
│       └── test
│           ├── 001.Black_footed_Albatross
│           ├── 002.Laysan_Albatross
│           ├── ...
│           └── 200.Common_Yellowthroat
└── attributes.txt
```
When using the CUB-200 dataset for experiment, 'data_path', one of the optional arguments, should be 'root/CUB_200_2011/CUB_200_2011_new/'. The 'root' indicates the root path of the CUB_200_2011 folder.
### 1.2 INaturalist-2018
After the INaturalist-2018 dataset is downloaded, extract the compressed package to obtain the following directory structure folder. (All images are distributed in subfolders within the images folder. The subfolder name is the category to which the image belongs.)
```shell
inaturalist2018
└── iNat2018_train_val
    ├── Actinopterygii
    │   └── ...
    ├── Amphibia
    │   └── ...
    ├── Animalia
    │   └── ...
    ├── ...
    └── Reptilia
        └── ...
```
Create the iNat2018_train.txt and iNat2018_val.txt files under the inaturalist2018 folder according to the official Training annotations file and Validation anotations file.
The following is an example of the contents of iNat2018_train.txt.
```shell
iNat2018_train_val/Plantae/7477/3b60c9486db1d2ee875f11a669fbde4a.jpg 7477
...
```
where 'iNat2018_train_val/Plantae/7477/3b60c9486db1d2ee875f11a669fbde4a.jpg' is training data's path and 7477 is the category to which it belongs.

When using the INaturalist-2018 dataset for experiment, 'data_path', one of the optional arguments, should be 'root/inaturalist2018/'. The 'root' indicates the root path of the inaturalist2018 folder. 
At the same time, we need to modify the data/INat2018.py. To be specific, 45-th line: path = 'root/inaturalist2018/'; 46-th line: train_txt = "/root/inaturalist2018/iNat2018_train.txt"; 47-th line: test_txt = "/root/inaturalist2018/iNat2018_val.txt"

## 2. Pruning on CUB-200 and CIFAR-100 Datasets
The following are all examples of pruning experiments on CUB-200 dataset. Change "--data_set cub200" and "--data_path 'CUB200 DATASET DIR'" to "--data_set cifar100" and "--data_path 'CIFA100 DATASET DIR'" respectively, which enables to run experiments on CIFAR-100 dataset.

There are examples of different pipelines using L1/Random/HRank/Network Slimming/EPruner/Depgraph pruning methods. 


### 2.1 Example for ST (in page 5 line 211, the basline for STP<sup>w</sup>T)
```shell
# For instance, prune ResNet-50 on CUB-200 dataset.
python train.py \
    --data_set cub200 \
    --data_path 'CUB200 DATASET DIR' \
    --job_dir 'SAVE PATH' \
    --gpus 0,1 \
    --use_pretrain \ # Inherit model weights which is pretrained on ImageNet-1K dataset
    --transfer \ # Model will be transferred to target dataset
    --finetune_epochs 120 \
    --lr_decay_step_finetune 50,90 \
```

### 2.2 Example for STP<sup>w</sup>T (in page 5 line 220)
The command code to run the experiment consists of two parts: the main part and the optional methods. The main part determines the pipeline of the experiment, and the optional methods determine which pruning method to use for the experiment. The main part is used in conjunction with any of the optional methods to implement experiment.
#### Main Part
```shell
# For instance, prune ResNet-50 on CUB-200 dataset.
python train.py \
    --data_set cub200 \
    --data_path 'CUB200 DATASET DIR' \
    --job_dir 'SAVE PATH' \
    --gpus 0,1 \
    --transfer \ # Model will be transferred to target dataset
    --hard_inherit \ # Pruned model is finetuned with the inherited pretrained weights
    --resume_pretrain 'MODEL_BEST.PT PATH' \ # This model_best.pt file is obtained by the ST experiment.
    --train_epochs 120 \
```
#### Optional Methods
```shell
# L1
    --prune_rule l1_pretrain \ # L1
    --compress_rate 1.0*4+0.7*16 \ # Layer-wise compress rate for L1
    --finetune_epochs 185 \
    --lr_decay_step_finetune 80,140 \
----------------------------------------------------------------------------------------
# Random
    --prune_rule random_pretrain \ # Random
    --compress_rate 1.0*4+0.7*16 \ # Layer-wise compress rate for Random
    --finetune_epochs 185 \
    --lr_decay_step_finetune 80,140 \
----------------------------------------------------------------------------------------
# DepGraph
    --prune_rule depgraph_pretrain \ # DepGraph
    --target_flops_PR 0.46 \ # Pruning factor for DepGraph
    --finetune_epochs 225 \
    --lr_decay_step_finetune 95,170 \
----------------------------------------------------------------------------------------
# HRank
    --prune_rule hrank_pretrain \ # HRank
    --compress_rate 1.0*4+0.7*16 \ # Layer-wise compress rate for HRank
    --rank_path [Prefix directory of rank folder for HRank method] \
    --finetune_epochs 185 \
    --lr_decay_step_finetune 80,140 \
 # Tips: If the prune_rule is 'hrank_pretrain', i.e. using HRank pruning method for experiment, Rank Generation is needed before pruning. See HRank' implement about rank generation for detail.
----------------------------------------------------------------------------------------
# Network slimming
    --prune_rule NS_pretrain \ # Network slimming
    --channel_PR 0.5 \ # Pruning rate for Network Slimming
    --finetune_epochs 315 \
    --lr_decay_step_finetune 130,235 \
----------------------------------------------------------------------------------------
# EPruner
    --prune_rule epruner_pretrain \ # EPruner
    --preference_beta 0.78 \ # Pruning factor for EPruner
    --init_method centroids \ # the way that model pruned by EPruner inherits weights
    --finetune_epochs 335 \
    --lr_decay_step_finetune 150,270 \
```

### 2.3 Example for SP<sup>r</sup>T (in page 5 line 214-215)
```shell
python train.py \
    --data_set cub200 \
    --data_path 'CUB200 DATASET DIR' \
    --job_dir 'SAVE PATH' \
    --gpus 0,1 \
    --compress_rate 1.0*4+0.7*16 \ # Layer-wise compress rate for generating a slim model
    --train_slim \ # Train a slim model from scratch
    --finetune_epochs 350 \
    --lr_decay_step_finetune 250,290,320 \
```


### 2.4 Example for SP<sup>w</sup>T (in page 5 line 214-215)
The command code to run the experiment consists of two parts: the main part and the optional methods. The main part determines the pipeline of the experiment, and the optional methods determine which pruning method to use for the experiment. The main part is used in conjunction with any of the optional methods to implement experiment.
#### Main Part
```shell
# For instance, prune ResNet-50 on CUB-200 dataset.
python train.py \
    --data_set cub200 \
    --data_path 'CUB200 DATASET DIR' \
    --job_dir 'SAVE PATH' \
    --gpus 0,1 \
    --use_pretrain \ # Inherit model weights which is pretrained on ImageNet-1K dataset
    --hard_inherit \ # Pruned model is finetuned with the inherited pretrained weights
```
#### Optional Methods
```shell
# L1
    --prune_rule l1_pretrain \ # L1
    --compress_rate 1.0*4+0.6*16 \ # Layer-wise compress rate for L1
    --finetune_epochs 225 \
    --lr_decay_step_finetune 95,170 \
----------------------------------------------------------------------------------------
# Random
    --prune_rule random_pretrain \ # Random
    --compress_rate 1.0*4+0.6*16 \ # Layer-wise compress rate for Random
    --finetune_epochs 225 \
    --lr_decay_step_finetune 95,170 \
----------------------------------------------------------------------------------------
# DepGraph
    --prune_rule depgraph_pretrain \ # DepGraph
    --target_flops_PR 0.55 \ # Pruning factor for DepGraph
    --finetune_epochs 265 \
    --lr_decay_step_finetune 115,200 \
----------------------------------------------------------------------------------------
# HRank
    --prune_rule hrank_pretrain \ # HRank
    --compress_rate 1.0*4+0.6*16 \ # Layer-wise compress rate for HRank
    --rank_path [Prefix directory of rank folder for HRank method] \
    --finetune_epochs 225 \
    --lr_decay_step_finetune 95,170 \
 # Tips: If the prune_rule is 'hrank_pretrain', i.e. using HRank pruning method for experiment, Rank Generation is needed before pruning. See HRank' implement about rank generation for detail.
----------------------------------------------------------------------------------------
# Network slimming
    --prune_rule NS_pretrain \ # Network slimming
    --channel_PR 0.6 \ # Pruning rate for Network Slimming
    --finetune_epochs 385 \
    --lr_decay_step_finetune 160,290 \
----------------------------------------------------------------------------------------
# EPruner
    --prune_rule epruner_pretrain \ # EPruner
    --preference_beta 0.70 \ # Pruning factor for EPruner
    --init_method centroids \ # the way that model pruned by EPruner inherits weights
    --finetune_epochs 255 \
    --lr_decay_step_finetune 105,190 \
```

### 2.5 Example for TP<sup>w</sup>T (in page 5 line 222)
After performing one time TP<sup>w</sup>T experiment, we get the full-size model trained on the target dataset. To control variables, the subsequent TP<sup>w</sup>T experiments will use '--resume_pretrain [model_best.pt obtained by the first TP<sup>w</sup>T experiment]' to load the same full-size model weights trained on the target dataset.

The command code to run the experiment consists of two parts: the main part and the optional methods. The main part determines the pipeline of the experiment, and the optional methods determine which pruning method to use for the experiment. The main part is used in conjunction with any of the optional methods to implement experiment.
#### Main Part
```shell
# For instance, prune ResNet-50 on CUB-200 dataset.
python train.py \
    --data_set cub200 \
    --data_path 'CUB200 DATASET DIR' \
    --job_dir 'SAVE PATH' \
    --gpus 0,1 \
    --hard_inherit \ # Pruned model is finetuned with the inherited pretrained weights
    --lr 0.01 \
    (--resume_pretrain 'MODEL_BEST.PT PATH' \ # model_best.pt obtained by the first TP^wT experiment)
```
#### Optional Methods
```shell
# L1
    --prune_rule l1_pretrain \ # L1
    --compress_rate 1.0*4+0.4*16 \ # Layer-wise compress rate for L1
    --finetune_epochs 330 \
    --lr_decay_step_finetune 140,250 \
----------------------------------------------------------------------------------------
# Random
    --prune_rule random_pretrain \ # Random
    --compress_rate 1.0*4+0.4*16 \ # Layer-wise compress rate for Random
    --finetune_epochs 330 \
    --lr_decay_step_finetune 140,250 \
----------------------------------------------------------------------------------------
# DepGraph
    --prune_rule depgraph_pretrain \ # DepGraph
    --target_flops_PR 0.64 \ # Pruning factor for DepGraph
    --finetune_epochs 335 \
    --lr_decay_step_finetune 140,250 \
----------------------------------------------------------------------------------------
# HRank
    --prune_rule hrank_pretrain \ # HRank
    --compress_rate 1.0*4+0.4*16 \ # Layer-wise compress rate for HRank
    --rank_path [Prefix directory of rank folder for HRank method] \
    --finetune_epochs 330 \
    --lr_decay_step_finetune 140,250 \
 # Tips: If the prune_rule is 'hrank_pretrain', i.e. using HRank pruning method for experiment, Rank Generation is needed before pruning. See HRank' implement about rank generation for detail.
----------------------------------------------------------------------------------------
# Network slimming
    --prune_rule NS_pretrain \ # Network slimming
    --channel_PR 0.3 \ # Pruning rate for Network Slimming
    --finetune_epochs 215 \
    --lr_decay_step_finetune 90,160 \
----------------------------------------------------------------------------------------
# EPruner
    --prune_rule epruner_pretrain \ # EPruner
    --preference_beta 0.50 \ # Pruning factor for EPruner
    --init_method centroids \ # the way that model pruned by EPruner inherits weights
    --finetune_epochs 150 \
    --lr_decay_step_finetune 65,115 \
```

### 2.6 Optional Arguments

```shell
python train.py \
    --data_set              Dataset name. Optional: cub200, cifar100. default: cub200
    --data_path             Dataset directory.
    --job_dir               The directory where the summaries will be stored.
    --gpus                  Limit the list of available GPU devices that the program can see. default: 0,1
    --manualSeed            Manual seed.
    --cfg                   Architecture of model. default: resnet50
    --prune_rule            Select pruning method to experiment. 
                            Optional: l1_pretrain, random_pretrain, hrank_pretrain, NS_pretrain, depgraph_pretrain, epruner_pretrain. 
                            default: l1_pretrain
    --compress_rate         Compress rate of each convolutional layer for l1_pretrain/random_pretrain/hrank_pretrain pruning methods. 
                            default: 1.0*100
    --rank_path             Prefix directory of rank folder for HRank method.
    --channel_PR            Channel pruning rate for NS_pretrain (Network Slimming) pruning method.
    --target_flops_PR       Channel pruning rate for depgraph_pretrain (DepGraph) pruning method.
    --preference_beta       Channel pruning factor for epruner_pretrain (EPruner) pruning method.
    --init_method           Initital method of pruned model for epruner_pretrain (EPruner). 
                            Optional: random, centroids, random_project. default: centroids
    --use_pretrain          If inherit model weights which is pretrained on ImageNet-1K dataset.
    --transfer              If model is transferred to target dataset.
    --hard_inherit          If the pruned model is finetuned with the inherited pretrained weights
    --train_slim            If the model is structured by the preset pruning rate.
    --train_batch_size      Batch size for training. default: 32
    --eval_batch_size       Batch size for validation. default: 32
    --train_epochs          The number of epochs to pretrain slim model from scratch on target dataset. 
                            default: 350
    --finetune_epochs       The number of epochs to finetune pruned model on target dataset. default: 200
    --momentum              Momentum for optimizer. default: 0.9
    --lr_train              Learning rate for pretraining. default: 0.01
    --lr                    Initial learning rate for finetuning. default: 0.001
    --lr_type               Learning rate decay schedule. default: step. optional: step, cos
    --lr_decay_step_pretrain The iterval of learn rate decay for pretraining. default: 250,290,320
    --lr_decay_step_finetune The iterval of learn rate decay for finetuning. default: 80,120,160
    --train_weight_decay    The weight decay of loss function for pretraining. default: 5e-4
    --weight_decay          The weight decay of loss function for finetuning. default: 5e-4
    --resume_pretrain       Pretrained checkpoint path.
    --resume_finetune       Finetuned checkpoint path.
```



## 3. Pruning on INaturalist-2018 Datasets
There are examples of different pipelines using L1/Random/HRank/Network Slimming/EPruner/Depgraph pruning methods. When using the Random or Hrank pruning method, just change '--prune_rule l1-pretrain' to '--prune_rule random_pretrain' or '--prune_rule hrank_pretrain' in the pipeline with L1 pruning method. When using the Depgraph pruning method, change '--prune_rule l1-pretrain' to '--prune_rule depgraph_pretrain' in the pipeline with L1. Also, remove the '--compress_rate' and add '--target_flops_PR'. When using the Network Slimming pruning method, change '--prune_rule l1-pretrain' to '--prune_rule NS_pretrain'. Also, remove the '--compress_rate' and add '--channel_PR'.

### 3.1 Example for ST (in page 5 line 211)
```shell
python train_DDP.py \
    --data_set inaturalist2018 \
    --data_path 'INATURALIST2018 DATASET DIR' \
    --job_dir 'SAVE PATH' \
    --gpus 0,1 \
    --cfg resnet50 \
    --use_pretrain \
    --transfer \
    --train_batch_size 512 \
    --eval_batch_size 512 \
    --finetune_epochs 90 \
    --momentum 0.9 \
    --lr 0.02 \
    --lr_type cos \
    --weight_decay 1e-4 \
    --dist_url 'tcp://localhost:50001' \
    --multiprocessing_distributed \
    --world_size 1 \
    --local_rank 0 \
```

### 3.2 Example for STP<sup>w</sup>T (in page 5 line 220)
The command code to run the experiment consists of two parts: the main part and the optional methods. The main part determines the pipeline of the experiment, and the optional methods determine which pruning method to use for the experiment. The main part is used in conjunction with any of the optional methods to implement experiment.
#### Main Part
```shell
python train.py \
    --data_set inaturalist2018 \
    --data_path 'INATURALIST2018 DATASET DIR' \
    --job_dir 'SAVE PATH' \
    --gpus 0,1 \
    --cfg resnet50 \
    --transfer \
    --hard_inherit \ # Pruned model is finetuned with the inherited pretrained weights
    --train_batch_size 512 \
    --eval_batch_size 512 \
    --momentum 0.9 \
    --lr 0.02 \
    --lr_type cos \
    --weight_decay 1e-4 \
    --dist_url 'tcp://localhost:50001' \
    --multiprocessing_distributed \
    --world_size 1 \
    --local_rank 0 \
    --train_epochs 90 \
    --resume_pretrain 'MODEL_BEST.PT PATH' \ # This model_best.pt file is obtained by the ST experiment.
```
#### Optional Methods
```shell
# L1
    --prune_rule l1_pretrain \ # L1
    --compress_rate 1.0*4+0.7*16 \ # Layer-wise compress rate for L1
    --finetune_epochs 140 \
----------------------------------------------------------------------------------------
# Random
    --prune_rule random_pretrain \ # Random
    --compress_rate 1.0*4+0.7*16 \ # Layer-wise compress rate for Random
    --finetune_epochs 140 \
----------------------------------------------------------------------------------------
# DepGraph
    --prune_rule depgraph_pretrain \ # DepGraph
    --target_flops_PR 0.46 \ # Pruning factor for DepGraph
    --finetune_epochs 165 \
----------------------------------------------------------------------------------------
# HRank
    --prune_rule hrank_pretrain \ # HRank
    --compress_rate 1.0*4+0.7*16 \ # Layer-wise compress rate for HRank
    --rank_path [Prefix directory of rank folder for HRank method] \
    --finetune_epochs 140 \
 # Tips: If the prune_rule is 'hrank_pretrain', i.e. using HRank pruning method for experiment, Rank Generation is needed before pruning. See HRank' implement about rank generation for detail.
----------------------------------------------------------------------------------------
# Network slimming
    --prune_rule NS_pretrain \ # Network slimming
    --channel_PR 0.45 \ # Pruning rate for Network Slimming
    --finetune_epochs 195 \
----------------------------------------------------------------------------------------
# EPruner
    --prune_rule epruner_pretrain \ # EPruner
    --preference_beta 0.78 \ # Pruning factor for EPruner
    --init_method centroids \ # the way that model pruned by EPruner inherits weights
    --finetune_epochs 200 \
```

### 3.3 Example for SP<sup>w</sup>T (in page 5 line 214-215)
The command code to run the experiment consists of two parts: the main part and the optional methods. The main part determines the pipeline of the experiment, and the optional methods determine which pruning method to use for the experiment. The main part is used in conjunction with any of the optional methods to implement experiment.
#### Main Part
```shell
python train_DDP.py \
    --data_set inaturalist2018 \
    --data_path 'INATURALIST2018 DATASET DIR' \
    --job_dir 'SAVE PATH' \
    --gpus 0,1 \
    --cfg resnet50 \
    --use_pretrain \
    --hard_inherit \ # Pruned model is finetuned with the inherited pretrained weights
    --train_batch_size 512 \
    --eval_batch_size 512 \
    --finetune_epochs 200 \
    --momentum 0.9 \
    --lr 0.02 \
    --lr_type cos \
    --weight_decay 1e-4 \
    --dist_url 'tcp://localhost:50001' \
    --multiprocessing_distributed \
    --world_size 1 \
    --local_rank 0 \
```
#### Optional Methods
```shell
# L1
    --prune_rule l1_pretrain \ # L1
    --compress_rate 1.0*4+0.6*16 \ # Layer-wise compress rate for L1
    --finetune_epochs 200 \
----------------------------------------------------------------------------------------
# Random
    --prune_rule random_pretrain \ # Random
    --compress_rate 1.0*4+0.6*16 \ # Layer-wise compress rate for Random
    --finetune_epochs 200 \
----------------------------------------------------------------------------------------
# DepGraph
    --prune_rule depgraph_pretrain \ # DepGraph
    --target_flops_PR 0.55 \ # Pruning factor for DepGraph
    --finetune_epochs 200 \
----------------------------------------------------------------------------------------
# HRank
    --prune_rule hrank_pretrain \ # HRank
    --compress_rate 1.0*4+0.6*16 \ # Layer-wise compress rate for HRank
    --rank_path [Prefix directory of rank folder for HRank method] \
    --finetune_epochs 200 \
 # Tips: If the prune_rule is 'hrank_pretrain', i.e. using HRank pruning method for experiment, Rank Generation is needed before pruning. See HRank' implement about rank generation for detail.
----------------------------------------------------------------------------------------
# Network slimming
    --prune_rule NS_pretrain \ # Network slimming
    --channel_PR 0.66 \ # Pruning rate for Network Slimming
    --finetune_epochs 200 \
----------------------------------------------------------------------------------------
# EPruner
    --prune_rule epruner_pretrain \ # EPruner
    --preference_beta 0.70 \ # Pruning factor for EPruner
    --init_method centroids \ # the way that model pruned by EPruner inherits weights
    --finetune_epochs 190 \
```


### 3.4 Optional Arguments

```shell
    python train_DDP.py \
    --data_set                      Dataset name. Optional: cub200, cifar100. default: cub200
    --data_path                     Dataset directory.
    --job_dir                       The directory where the summaries will be stored.
    --gpus                          Limit the list of available GPU devices that the program can see. default: 0,1
    --manualSeed                    Manual seed.
    --cfg                           Architecture of model. default: resnet50
    --prune_rule                    Select pruning method to experiment. 
                                    Optional: l1_pretrain, random_pretrain, hrank_pretrain, NS_pretrain, depgraph_pretrain, epruner_pretrain. 
                                    default: l1_pretrain
    --compress_rate                 Compress rate of each convolutional layer for l1_pretrain/random_pretrain/hrank_pretrain pruning methods. 
                                    default: 1.0*100
    --rank_path                     Prefix directory of rank folder for HRank method.
    --channel_PR                    Channel pruning rate for NS_pretrain (Network Slimming) pruning method.
    --target_flops_PR               Channel pruning rate for depgraph_pretrain (DepGraph) pruning method.
    --preference_beta               Channel pruning factor for epruner_pretrain (EPruner) pruning method.
    --init_method                   Initital method of pruned model for epruner_pretrain (EPruner). 
                                    Optional: random, centroids, random_project. default: centroids
    --use_pretrain                  If inherit model weights which is pretrained on ImageNet-1K dataset.
    --transfer                      If model is transferred to target dataset.
    --hard_inherit                  If the pruned model is finetuned with the inherited pretrained weights.
    --train_slim                    If the model is structured by the preset pruning rate.
    --train_batch_size              Batch size for training. default: 32
    --eval_batch_size               Batch size for validation. default: 32
    --finetune_epochs               The number of epochs to finetune pruned model on target dataset. default: 200
    --momentum                      Momentum for optimizer. default: 0.9
    --lr                            Initial learning rate for finetuning. default: 0.001
    --lr_type                       Learning rate decay schedule. default: step. optional: step, cos
    --lr_decay_step_finetune        The iterval of learn rate decay for finetuning. default: 80,120,160
    --weight_decay                  The weight decay of loss function for finetuning. default: 5e-4
    --resume_pretrain               Pretrained checkpoint path.
    --resume_finetune               Finetuned checkpoint path.
    --workers                       Number of data loading workers. default: 32
    --world_size                    Number of nodes for distributed training. default: 1
    --local_rank                    Node rank for distributed training. default: 0
    --dist_url                      Url used to set up distributed training. default: 'tcp://localhost:50001'
    --dist_backend                  Distributed backend. default: 'nccl'
    --multiprocessing_distributed   Use multi-processing distributed training to launch N processes per node, which has N GPUs. 
                                    This is the fastest way to use PyTorch for either single node or multi node data parallel training.
    --gpu                           GPU id to use. default: None
```


