Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical EvidenceDownload PDF

Fengxiang He, Tongliang Liu, Dacheng Tao

06 Sept 2019 (modified: 05 May 2023)NeurIPS 2019Readers: Everyone
Abstract: Deep neural networks have received dramatic success based on the optimization method of stochastic gradient descent (SGD). However, how to tune hyper-parameters, especially batch size and learning rate, is still not clear. This paper finds both theoretical and empirical evidence of a training strategy that we should control the ratio of batch size to learning rate not too large to achieve a good generalization ability. Specifically, we prove an $\mathcal O(1/\sqrt{N})$ ($N$ is the training sample size) generalization bound for neural networks trained by SGD, which has a positive correlation with the ratio of batch size to learning rate. This correlation builds the theoretical foundation of the training strategy. Furthermore, we conduct a large-scale experiment to verify the correlation and training strategy. We trained over 1,600 models based on architectures ResNet-110 and VGG-19 with datasets CIFAR-10 and CIFAR-100 while strictly control unrelated variables. Accuracies on the test sets are collected for the evaluation. Pearson correlation coefficients and the corresponding $p$ values on 164 groups of the collected data demonstrate that the correlation is statistically significant, which fully supports the training strategy.
CMT Num: 676
1 Reply

Loading