Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on stationary problems, and permitting learning rates to grow appropriately in non-stationary tasks. Here, we extend the idea in three directions, addressing proper minibatch parallelization, including reweighted updates for sparse or orthogonal gradients, improving robustness on non-smooth loss functions, in the process replacing the diagonal Hessian estimation procedure that may not always be available by a robust finite-difference approximation. The final algorithm integrates all these components, has linear complexity and is hyper-parameter free.
State From To ( Cc) Subject Date Due Action
Fulfill
Tom Schaul ICLR 2013 Conference Track
Fulfilled: ICLR 2013 call for conference papers

16 Jan 2013
Completed
Tom Schaul ICLR 2013 Conference Track
Request for Endorsed for oral presentation: Adaptive learning rates and...

16 Jan 2013
Reveal: document
Tom Schaul
Revealed: document: Adaptive learning rates and parallelization for stochastic,...

05 Feb 2013
Completed
Aaron Courville Anonymous 7b8e
Request for review of Adaptive learning rates and parallelization for...

05 Feb 2013 01 Mar 2013
Completed
Aaron Courville Anonymous 0321
Request for review of Adaptive learning rates and parallelization for...

05 Feb 2013 01 Mar 2013
Completed
Aaron Courville Anonymous 7318
Request for review of Adaptive learning rates and parallelization for...

05 Feb 2013 01 Mar 2013
Reveal: document
ICLR 2013 Conference Track
Revealed: document: Endorsed for poster presentation: Adaptive learning rates...

27 Mar 2013
Fulfill
ICLR 2013 Conference Track Tom Schaul
Fulfilled: Request for Endorsed for oral presentation: Adaptive learning rates...

27 Mar 2013

4 Comments

Anonymous 0321 22 Feb 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 0321
Revealed: document: review of Adaptive learning rates and parallelization for...

22 Feb 2013
Fulfill
Anonymous 0321 Aaron Courville
Fulfilled: Request for review of Adaptive learning rates and parallelization for...

22 Feb 2013
This is a followup paper for reference [1] which describes a parameter free adaptive method to set learning rates for SGD. This submission cannot be read without first reading [1]. It expands the work in several directions: the impact of minibatches, the impact of sparsity and gradient orthonormality, and the use of finite difference techniques to approximate curvature. The proposed methods are justified with simple theoretical considerations under simplifying assumptions and with serious empirical studies. I believe that these results are useful. On the other hand, an opportunity has been lost to write a more substantial self-contained paper. As it stands, the submission reads like three incremental contributions stappled together.
Please log in to comment.
Anonymous 7318 27 Feb 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 7318
Revealed: document: review of Adaptive learning rates and parallelization for...

27 Feb 2013
Fulfill
Anonymous 7318 Aaron Courville
Fulfilled: Request for review of Adaptive learning rates and parallelization for...

27 Feb 2013
summary: The paper proposes a new variant of stochastic gradient descent that is fully automated (no hyper-parameter to tune) and is robust to various scenarios, including mini-batches, sparsity, and non-smooth gradients. It relies on an adaptive learning rate that takes into account a moving average of the Hessian. The result is a single algorithm that takes about 4x memory (with respect to the size of the model) and is easy to implement. The algorithm is tested on purely artificial tasks, as a proof of concept. review. - The paper relies on some previous algorithm (bbprop) that is not provided here and only explained briefly on page 5, while first used on page 2. It would have been nice to provide more information about it earlier. - The "parallelization trick" using mini-batches is good for a single-machine approach, where one can use multiple cores, but is thus limited by the number of cores. Also, how would this "interfere" with Hogwild type of updates, which also uses efficiently multi-core approaches for SGD? - Obviously, results on real large datasets would have been welcomed (I do think experiments on artificial datasets are very useful as well, but they may hide the fact that we have not fully understood the complexity of real datasets).
Please log in to comment.
Anonymous 7b8e 03 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 7b8e
Revealed: document: review of Adaptive learning rates and parallelization for...

03 Mar 2013
Fulfill
Anonymous 7b8e Aaron Courville
Fulfilled: Request for review of Adaptive learning rates and parallelization for...

03 Mar 2013
This is a paper that builds up on the adaptive learning rate scheme proposed in [1], for choosing learning rate when optimizing a neural network. The first result (eq. 3) is that of figuring out an optimal learning rate schedule for a given mini-batch size n (a very realistic scenario, when one cannot adapt the size of the mini-batch during training because of computational and architectural constraints). The second interesting result is that of setting the learning rates in those cases where one has sparse gradients (rectified linear units etc) -- this results in an effective rescaling of the rates by the number of non-zero elements in a given minibatch. The third nice result is the observation that in a sparse situation the gradient update directions are mostly orthogonal. Taking this intuition to the logical conclusion, the authors thus induce a re-weighing scheme that essentially encourages the gradient updates to be orthogonal to each other (by weighing them proportionally to 1/number of times they interfere with each other). While the authors claim that this can be computationally expensive generally speaking, for problems of realistic sizes (d is in the tens of millions and n is a few dozen examples), this can be quite interesting. The final interesting result is that of adapting the curvature estimation to the fact that with the advent of rectified linear units we are often faced with optimizing non-smooth loss functions. The authors propose a method that is based on finite differences (with some robustness improvements) and is vaguely similar to what is done in SGD-QN. Generally this is a very well-written paper that proposes a few sensible and relatively easy to implement ideas for adaptive learning rate schemes. I expect researchers in the field to find these ideas valuable. One disappointing aspect of the paper is the lack of real-world results on things other than simulated (and known) loss functions.
Please log in to comment.
Tom Schaul, Yann LeCun 05 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Tom Schaul, Yann LeCun
Revealed: document:

05 Mar 2013
We thank the reviewers for their constructive comments. We'll try to clarify a few points they bring up: Parallelization: The batchsize-aware adaptive learning rates (equation 3) are applicable independently of how the minibatches are computed, whether on a multi-core machine, or across multiple machines. They are in fact complementary to the asynchronous updates of Hogwild, in that they remove its need for tuning learning rate ("gamma") and learning rate decay ("beta"). Bbprop: The original version of vSGD (presented in [1]) does indeed require the "bbprop" algorithm as one of its components to estimate element-wise curvature. One of the main points of this paper, however, is to replace it by a less brittle approach, based on finite-differences (section 5). Large-scale experiments: We conduced a broad range of such experiments in the precursor paper [1] which demonstrated that the performance of the adaptive learning rates does correspond to the best-tuned SGD. Under the assumption that curvature does not change too fast, the original vSGD (using bbprop) and the one presented here (using finite differences) are equivalent, so those results are still valid -- but for more difficult (non-smooth) learning problems the new variant should be much more robust. We'd also like to point out that an open-source implementation is now available at http://github.com/schaul/py-optim/blob/master/PyOptim/algorithms/vsgd.py
Please log in to comment.
ICLR 2013 Conference Track 27 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
ICLR 2013 Conference Track
Revealed: document: Endorsed for poster presentation: Adaptive learning rates...

27 Mar 2013
Fulfill
ICLR 2013 Conference Track Tom Schaul
Fulfilled: Request for Endorsed for oral presentation: Adaptive learning rates...

27 Mar 2013
Endorsed for poster presentation: Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients
Please log in to comment.

Please log in to comment.