# Randomness Helps Rigor: A Probabilistic Learning Rate Scheduler Bridging Theory and Deep Learning Practice

## Abstract

Learning rate schedulers have shown great success in speeding up the convergence of learning algorithms in practice. However, their convergence to a minimum has not been proven theoretically. This difficulty mainly arises from the fact that, while traditional convergence analysis prescribes to monotonically decreasing (or constant) learning rates, schedulers opt for rates that often increase and decrease through the training epochs. In this work, we aim to bridge the gap by proposing a probabilistic learning rate scheduler (PLRS) that does not conform to the monotonically decreasing condition, with provable convergence guarantees. To cement the relevance and utility of our work in modern day applications, we show experimental results on deep neural network architectures such as ResNet, WRN, VGG, and DenseNet on  CIFAR-10, CIFAR-100, and Tiny ImageNet datasets. We show that PLRS performs as well as or better than existing state-of-the-art learning rate schedulers in terms of convergence as well as accuracy. For example, while training ResNet-110 on the CIFAR-100 dataset, we outperform the state-of-the-art knee scheduler by 1.56% in terms of classification accuracy. Furthermore, on the Tiny ImageNet dataset using ResNet-50 architecture, we show a significantly more stable convergence than the cosine scheduler and a better classification accuracy than the existing schedulers.


### Results Summary

### Cifar 100

| Model       | Learning Rate Schedule  | Training Accuracy (%) | Test Accuracy (%) | Accuracy Drop (%) |
|-------------|-------------------------|-----------------------|-------------------|-------------------|
| ResNet-110  | Cosine                  | 74.22                 | 72.66             | 1.56              |
| ResNet-110  | Knee                    | 75.78                 | 72.39             | 2.96              |
| ResNet-110  | One-cycle               | 71.09                 | 70.05             | 1.19              |
| ResNet-110  | Constant                | 69.53                 | 66.67             | 2.51              |
| ResNet-110  | Multi-step              | 63.28                 | 61.20             | 2.39              |
| ResNet-110  | PLRS (ours)             | **77.34**             | **74.61**         | 2.95              |
| DenseNet-40-12| Cosine                | 82.81                 | 80.47             | 2.07              |
| DenseNet-40-12| Knee                  | 82.81                 | 80.73             | 2.39              |
| DenseNet-40-12| One-cycle             | 73.44                 | 72.39             | 0.90              |
| DenseNet-40-12| Constant              | 82.81                 | 80.73             | 2.39              |
| DenseNet-40-12| Multi-step            | **87.50**             | **84.89**         | 2.39              |
| DenseNet-40-12| PLRS (ours)           | 84.37                 | 83.33             | 0.90              |

### Cifar 10

| **Architecture** | **Scheduler** | **Max Test acc.** | **Mean test acc. (S.D)** |
|------------------|---------------|-------------------|-------------------------|
| VGG-16           | Cosine        | 96.87             | 96.09 (0.78)            |
| VGG-16           | Knee          | 96.87             | **96.35** (0.45)        |
| VGG-16           | One-cycle     | 90.62             | 89.06 (1.56)            |
| VGG-16           | Constant      | 96.09             | 96.06 (0.05)            |
| VGG-16           | Multi-step    | 92.97             | 92.45 (0.90)            |
| VGG-16           | PLRS (ours)   | **97.66**         | 96.09 (1.56)            |
| WRN-28-10        | Cosine        | 92.03             | 91.90 (0.13)            |
| WRN-28-10        | Knee          | **92.04**         | 91.64 (0.63)            |
| WRN-28-10        | One-cycle     | 87.76             | 87.37 (0.35)            |
| WRN-28-10        | Constant      | **92.04**         | **92.00** (0.08)        |
| WRN-28-10        | Multi-step    | 88.94             | 88.80 (0.21)            |
| WRN-28-10        | PLRS (ours)   | 92.02             | 91.43 (0.54)            |


##Tiny ImageNet with ResNet-50

 **Scheduler** | **Max Test acc.** | **Mean test acc. (S.D)**|
---------------|-------------------|-------------------------|
 Cosine        | 62.13             | **62.03 (0.15)**        |
 Knee          | 61.93             | 61.50 (0.42)            |
 One-cycle     | 52.24             | 51.99 (0.22)            |
 Constant      | 61.59             | 61.11 (0.42)            |
 Multi-step    | 61.28             | 61.20 (0.08)            |
 PLRS (ours)   | **62.34**         | 61.90 (0.73)            |
 
 

### Usage

* The code supports cifar10 and cifar100 dataset. To change it to cifar 100 the user is expected to modify the datset in the trainer.py file.

* For running Tiny ImageNet use the trainer_tiny_imagenet.py file.

* Replace the lr_scheduler.py in the location
"torch/optim/" with the lr_scheduler.py in the given repository. You should be able to find your torch directory within your interpreter folder.

* Uncomment the models that you wish to run with modified checkpoints and uncomment the lr_scheduler that you wish to run the code with. The hyperparameters are within the code; the user is not expected to change to replicate the same results in the paper.

* To run the codes, give the name of the appropriate trainer file in run.sh and do the following:

```
    chmod +x run.sh
   ./run.sh
```

### To run online tensor decomposition:

* To run the online tensor decomposition code, run the file "SGDFrobeniusPlot.m" and use the initialization provided in "init.mat" to reproduce the exact result of the paper for PLRS.

