submission by Jimmy Ba • Distributed Second-Order Optimization using Kronecker-Factored Approximations
TL;DR: Fixed typos pointed out by AnonReviewer1 and AnonReviewer4 and added the experiments in Fig. 6 showing the poor scaling of batch normalized SGD using a batch size of 2048 on googlenet.
Abstract: As more computational resources become available, machine learning researchers train ever larger neural networks on millions of data points using stochastic gradient descent (SGD). Although SGD scales well in terms of both the size of dataset and the number of parameters of the model, it has rapidly diminishing returns as parallel computing resources increase. Second-order optimization methods have an affinity for well-estimated gradients and large mini-batches, and can therefore benefit much more from parallel computation in principle. Unfortunately, they often employ severe approximations to the curvature matrix in order to scale to large models with millions of parameters, limiting their effectiveness in practice versus well-tuned SGD with momentum. The recently proposed K-FAC method(Martens and Grosse, 2015) uses a stronger and more sophisticated curvature approximation, and has been shown to make much more per-iteration progress than SGD, while only introducing a modest overhead. In this paper, we develop a version of K-FAC that distributes the computation of gradients and additional quantities required by K-FAC across multiple machines, thereby taking advantage of method’s superior scaling to large mini-batches and mitigating its additional overheads. We provide a Tensorflow implementation of our approach which is easy to use and can be applied to many existing codebases without modification. Additionally, we develop several algorithmic enhancements to K-FAC which can improve its computational performance for very large models. Finally, we show that our distributed K-FAC method speeds up training of various state-of-the-art ImageNet classification models by a factor of two compared to Batch Normalization(Ioffe and Szegedy, 2015).
Keywords: Deep learning, Optimization
Conflicts: cs.toronto.edu, google.com
Reply Type:
Author:
Visible To:
Hidden From:
13 Replies
[–][+]
Review - Distributed K-FAC
official review by AnonReviewer4 • Review - Distributed K-FAC
Review: In this paper, the authors present a partially asynchronous variant of the K-FAC method. The authors adapt/modify the K-FAC method in order to make it computationally tractable for optimizing deep neural networks. The method distributes the computation of the gradients and the other quantities required by the K-FAC method (2nd order statistics and Fisher Block inversion). The gradients are computed in synchronous manner by the ‘gradient workers’ and the quantities required by the K-FAC method are computed asynchronously by the ‘stats workers’ and ‘additional workers’. The method can be viewed as an augmented distributed Synchronous SGD method with additional computational nodes that update the approximate Fisher matrix and computes its inverse. The authors illustrate the performance of the method on the CIFAR-10 and ImageNet datasets using several models and compare with synchronous SGD.
The main contributions of the paper are:
1) Distributed variant of K-FAC that is efficient for optimizing deep neural networks. The authors mitigate the computational bottlenecks of the method (second order statistic computation and Fisher Block inverses) by asynchronous updating.
2) The authors propose a “doubly-factored” Kronecker approximation for layers whose inputs are too large to be handled by the standard Kronecker-factored approximation. They also present (Appendix A) a cheaper Kronecker factored approximation for convolutional layers.
3) Empirically illustrate the performance of the method, and show:
- Asynchronous Fisher Block inversions do not adversely affect the performance of the method (CIFAR-10)
- K-FAC is faster than Synchronous SGD (with and without BN, and with momentum) (ImageNet)
- Doubly-factored K-FAC method does not deteriorate the performance of the method (ImageNet and ResNet)
- Favorable scaling properties of K-FAC with mini-batch size
Pros:
- Paper presents interesting ideas on how to make computationally demanding aspects of K-FAC tractable.
- Experiments are well thought out and highlight the key advantages of the method over Synchronous SGD (with and without BN).
Cons:
- “…it should be possible to scale our implementation to a larger distributed system with hundreds of workers.” The authors mention that this should be possible, but fail to mention the potential issues with respect to communication, load balancing and node (worker) failure. That being said, as a proof-of-concept, the method seems to perform well and this is a good starting point.
- Mini-batch size scaling experiments: the authors do not provide validation curves, which may be interesting for such an experiment. Keskar et. al. 2016 (On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima) provide empirical evidence that large-batch methods do not generalize as well as small batch methods. As a result, even if the method has favorable scaling properties (in terms of mini-batch sizes), this may not be effective.
The paper is clearly written and easy to read, and the authors do a good job of communicating the motivation and main ideas of the method. There are a few minor typos and grammatical errors.
Typos:
- “updates that accounts for” — “updates that account for”
- “Kronecker product of their inverse” — “Kronecker product of their inverses”
- “where P is distribution over” — “where P is the distribution over”
- “back-propagated loss derivativesas” — “back-propagated loss derivatives as”
- “inverse of the Fisher” — “inverse of the Fisher Information matrix”
- “which amounts of several matrix” — “which amounts to several matrix”
- “The diagram illustrate the distributed” — “The diagram illustrates the distributed”
- “Gradient workers computes” — “Gradient workers compute”
- “Stat workers computes” — “Stat workers compute”
- “occasionally and uses stale values” — “occasionally and using stale values”
- “The factors of rank-1 approximations” — “The factors of the rank-1 approximations”
- “be the first singular value and its left and right singular vectors” — “be the first singular value and the left and right singular vectors … , respectively.”
- “\Psi is captures” — “\Psi captures”
- “multiplying the inverses of the each smaller matrices” — “multiplying the inverses of each of the smaller matrices”
- “which is a nested applications of the reshape” — “which is a nested application of the reshape”
- “provides a computational feasible alternative” — “provides a computationally feasible alternative”
- “according the geometric mean” — “according to the geometric mean”
- “analogous to shrink” — “analogous to shrinking”
- “applied to existing model-specification code” — “applied to the existing model-specification code”
- “: that the alternative parametrization” — “: the alternative parameterization”
Minor Issues:
- In paragraph 2 (Introduction) the authors mention several methods that approximate the curvature matrix. However, several methods that have been developed are not mentioned. For example:
1) (AdaGrad) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)
2) Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization (https://arxiv.org/abs/1607.01231)
3) adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs (http://link.springer.com/chapter/10.1007/978-3-319-46128-1_1)
4) A Self-Correcting Variable-Metric Algorithm for Stochastic Optimization (http://jmlr.org/proceedings/papers/v48/curtis16.html)
5) L-SR1: A Second Order Optimization Method for Deep Learning (/pdf?id=By1snw5gl)
- Page 2, equation s = WA, is there a dimension issue in this expression?
- x-axis for top plots in Figures 3,4,5,7 (Updates x XXX) appear to be a headings for the lower plots.
- “James Martens. Deep Learning via Hessian-Free Optimization” appears twice in References section.
Rating: 7: Good paper, accept
Confidence: 4: The reviewer is confident but not absolutely certain that the evaluation is correct
[–][+]
The additional communication cost of Distributed K-FAC is modest; The new revision addresses all the typo and references
public comment by Jimmy Ba • The additional communication cost of Distributed K-FAC is modest; The new revision addresses all the typo and references
Comment: Thank the reviewer for the valuable comments and detailed suggestions on how to improve the paper.
We used academic scale resources for our experiments (see our comment below), but it would indeed be interesting to see how our method scales to much more computational resources.
In the minibatch size experiment, we focused on training curves because the focus of our paper was on optimization rather than generalization; however, we qualitatively observed that the validation curves display a very similar trend as the training curves for most of the optimization. Models trained with Distributed K-FAC do overfit more than the BN baselines at the end, but this is likely due to the extra noise that BN is adding to the update (similar to drop-out), which is an unintended side-effect of that algorithm. One could considering adding other types of noise-based regularizers (e.g. some flavor of drop-out) to make up for this difference, if it were important.
The reviewer also asked about the potential communication bottleneck that could rise from scaling the distributed K-FAC algorithm. Scaling up the algorithm would primarily involve adding more gradient workers. Our scheme for computing gradients over large mini-batches is simply the same as the scheme used in standard synchronous SGD framework, and other researchers have already studied how to scale this up to hundreds of workers. We believe any additional communication costs specific to K-FAC will be modest.
In general, second-order statistics, i.e. the Kronecker factors, are computed using additional stats workers asynchronously and are independent from the gradient workers. The main communication bottleneck of distributed K-FAC is to communicate the Kronecker factors and to update all of the factors within a Fisher block at the same time on the parameter server. In a large CNN, the communication cost of transferring Kronecker factors amounts to O(K^4 C^2), where K is the kernel width and C is the number of channels, comparing to transferring gradient and parameters that is O(K^2 C^2). We think the communication bottleneck of transferring Kronecker factors, which is about K^2 more costly than transferring gradient, can be amortized by transferring and refresh the Kronecker factors occasionally (which was shown to work well in our experiments). For AlexNet, the Kronecker factors are refreshed once every 200 parameter updates which still gives a substantial 2x speed up comparing to our “improved Batch Norm” baseline.
We have created a new revision which addresses the typos found by the reviewer, adds the references they requested, and addresses other issues such as figure spacing and the problem with the “s = WA” equation (this revealed a more far-reaching problem with our notation which we have now rectified - thanks for catching this!). We are glad the reviewer appreciates the main ideas behind our paper and our experiments. In the next revision of paper, we plan to incorporate the above discussion about communication costs, etc.
[–][+]
Official Review
official review by AnonReviewer1 • Official Review
Review: The paper proposes an asynchronous distributed K-FAC method for efficient optimization of
deep networks. The authors introduce interesting ideas that many computationally demanding
parts of the original K-FAC algorithm can be efficiently implemented in distributed fashion. The
gradients and the second-order statistics are computed by distributed workers separately and
aggregated at the parameter server along with the inversion of the approximate Fisher matrix
computed by a separate CPU machine. The experiments are performed in CIFAR-10 and ImageNet
classification problems using models such as AlexNet, ResNet, and GoogleReNet.
The paper includes many interesting ideas and techniques to derive an asynchronous distributed
version from the original K-FAC. And the experiments also show good results on a few
interesting cases. However, I think the empirical results are not thorough and convincing
enough yet. Particularly, experiments on various and large number of GPU workers (in the same machine,
or across multiple workers) are desired. For example, as pointed by the authors in the answer of a comment,
Chen et.al. (Revisiting Distributed Synchronous SGD, 2015) used 100 workers to test their distributed deep
learning algorithm. Even considering that the authors have a limitation in computing resource under the
academic research setting, the maximum number of 4 or 8 GPUs seems too limited as the only test case of
demonstrating the efficiency of a distributed learning algorithm.
Rating: 6: Marginally above acceptance threshold
Confidence: 3: The reviewer is fairly confident that the evaluation is correct
[–][+]
We used academic scale computing resources, but we still addressed the key issues
public comment by Roger Baker Grosse • We used academic scale computing resources, but we still addressed the key issues
Comment: Thank you for your review. The reason we were limited to 8 GPUs (rather than 100, like the Google Brain paper you reference) is that our experiments were run on academic-scale computing resources.
We note, however, that our 8 GPUs experimental setup is comparable to the training environment used in the recent publications of the state-of-the-art ImageNet models (He et al. 2015) . We feel like 8 GPUs was a sufficient scale to address the key research questions surrounding the large-scale second-order optimization presented in our paper. We obtained substantial speedups on the four most widely used object recognition networks relative to a widely used and well-engineered baseline. All the evidence so far indicates that more highly parallel (i.e. larger mini-batch) settings are strictly more favorable to K-FAC relative to SGD, so the fact that the algorithm performs well even at this modest scale is a strong signal.
Furthermore, our method is based on the synchronous SGD framework, which has already been applied at a massive scale, so there’s no reason to believe we would suddenly encounter major scalability issues. In fact, because the inverse Fisher blocks are refreshed asynchronously, which can be implemented as a background thread on the parameter server, we can expect distributed K-FAC effectively has the same communication pattern and cost as synchronous SGD. Note that improving the efficiency of massive scale synchronous SGD is beyond the scope of our paper.
There is already some concern that machine learning conferences are becoming a party for the rich, as evidenced by the question that rose to #2 in the Deep Learning Symposium poll: https://www.facebook.com/events/636544206518160/permalink/636550569850857/
In our experiments, we ran a variety of architectures on ImageNet, and for each one, carefully tuned the baselines using a grid search over hyperparameters. This already makes it a very expensive set of experiments by academic standards. As a community, we need to be careful to avoid winding up in a system where you need the resources of a major company lab in order to get papers accepted to ICLR. If you have concerns about the scalability of particular aspects of the algorithm, perhaps you can suggest a way to address them while staying within a reasonable budget?
It will indeed be interesting to see how the algorithm scales up to very large-scale parallelism, which is something we intend to pursue now that the method has been validated at academic scale. (This is a good example of why a system of open scientific publication is useful to the tech industry -- without the method having been validated at academic scale, there probably would be nothing to justify trying it at industrial scale.)
[–][+]
official review by
official review by
[–][+]
Response to review from AnonReviewer2
public comment by James Martens • Response to review from AnonReviewer2
Comment: Thank you for your review.
In response to your concerns I can tell you we are currently working on an industrial-scale implementation of distributed K-FAC within TensorFlow. This should hopefully be available within a month or two, and will be significantly more powerful and easier to use than the prototype implementation analyzed in our experiments.
We remain hopeful that distributed K-FAC will scale well to a truly distributed setting with multiple networked machines. This seems likely since the number of iterations used by the method is much lower than the number of iterations used by Sync SGD, and gradients/parameters only need to be communicated once per iteration.
We have corrected the typos and other minor errors noted.
[–][+]
Questions
pre-review question by AnonReviewer1 • Questions
Question: Could you elaborate a bit more on how the SGD is implemented on 4 GPUs? And, why not compare to other distributed optimization methods such as Downpour SGD and Elastic Averaging SGD?
Isn't the 4 GPU too small scale to evaluate a distributed algorithm?
[–][+]
We use the most common synchronous SGD optimizer as our baseline
public comment by Jimmy Ba • We use the most common synchronous SGD optimizer as our baseline
Comment: Thanks for the comments. Our baseline is a synchronous SGD optimizer that uses the averaged gradient from each of the gradient workers. The parameter server adjusts the parameter once it has received all the gradients from each worker and blocking mechanism is used for synchronization among the gradient workers. It is the standard optimizer used by ResNet (He et.al. 2015). More recently, Chen et.al. (Revisiting Distributed Synchronous SGD, 2015) have shown the synchronous SGD outperforms the Downpour style asynchronous SGD. We think using synchronous SGD in our experiments is a fair and strong baseline. Our distributed K-FAC algorithm can be viewed as a straightforward add-on module to complement any distributed SGD system and should provide substantial improvement over the original SGD system, but such evaluating is beyond the scope of this paper. Moreover, distributed K-FAC, like any second-order methods, can benefit from a low variance gradient and has a good affinity to synchronous SGD method, where averaging the gradient workers is a straightforward low noise gradient estimation. Because we are using synchronous SGD method, using only 4 GPUs is still a good simulation for the truly distributed setup.
[–][+]
Paper is updated
public comment by James Martens • Paper is updated
Comment: We revised the paper, improving the writing quality and clarity of the Experiments section in particular.
[–][+]
pre-review question by
pre-review question by
[–][+]
Fixed figure arrangement in the revised paper. Training error is typically higher than validation error during the majority of the ImageNet training time.
public comment by Jimmy Ba • Fixed figure arrangement in the revised paper. Training error is typically higher than validation error during the majority of the ImageNet training time.
Comment: Thanks for pointing out the figure arrangement issue. We have fixed that in the latest version. The figure 3 has also been updated so that the legend does not cover the line plots.
It is generally true that validation error is higher than training error. Despite having 1.2 million images in the ImageNet training set, a data pre-processing pipeline is almost always used for training ImageNet that includes image jittering and aspect distortion. The pre-processing creates an augmented training set that is more difficult than the undistorted validation set. Therefore, the validation error is often lower than training error during the first 90% of the training time. This observation is consistent with the previously published results (He et al., 2015, "Deep Residual Learning for Image Recognition").
Thanks again for the comment.
ICLR committee final decision
acceptance by pcs • ICLR committee final decision