Keywords: deep learning, learning rate schedules, information theory, activation pattern temperature
Abstract: We consider the aspect of learning rate (LR-)scheduling in neural networks, which often significantly affects achievable training time and generalization performance. Although schedules such as 1-cycle offer substantial gains over base-line methods, the effect of LR-curves on the training process is not very well understood. In order to gain more insight into the training process, we combine information theoretic ideas and probabilistic optimization, namely simulated annealing. In more detail, we introduce the activation pattern temperature, which (i) captures changes in the non-linear behavior of ReLU networks and (ii) is free of hyperparameters and thus is more interpretable. Examining the training process, 1-cycle simply yields a linear decrease in temperature, reminiscent of successful cooling strategies in simulated annealing. In order to test a causal connection, we devise ActCooLR, an automatic LR-scheduler that produces declining temperature profiles. In experiments with various CNN architectures and different image classification data sets, we obtain results that perform favorably or exceed the performance of hand-tuned schedules.
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
TL;DR: We measure the probability of pattern changes in neural networks - the activation temperature - during the training process and derive a new learning rate schedule based on cooling profiles, reminiscent of simulated annealing.
Supplementary Material: pdf
11 Replies
Loading