Behavior of Mini-Batch Optimization for Training Deep Neural Networks on Large Datasets
01 Dec 2021 | Stochastic gradient descent large datasets convex optimization parallelized computation deep neural networksIntroduction
The recent rapid proliferation of neural networks as the approach of choice in many ML applications has spurred plentiful research on the best optimization techniques to implement in training them. By far the most common approach is a variation on gradient descent which can be simply described as follows:
Gradient Descent is a first order optimization method that updates model parameters by taking iterative steps in the direction opposite to the gradient of the function at each point. Gradient descent updates model parameters using all the data points in the training set. But with larger training datasets, gradient descent becomes computationally expensive. Hence, stochastic gradient descent methods have been developed where one example at a time can provide an approximate of the gradient of the objective function which reduces its computational complexity.
Vanilla mini-batch stochastic gradient descent (SGD), and less frequently full gradient descent, are the approaches that we most often associate with the training of deep neural networks (DNNs). [1] The reason for this lies in the frequent non-convex nature of the loss landscapes associated with DNNs, which require us to employ an iterative approach when finding optimum values for the weights used [2]. In addition to requiring the use of an iterative approach, this non-convexity introduces additional challenges in finding model weights which suitably minimize the loss. Due to these challenges, the development of improved optimization schedules is a very active research area in machine learning.
The study of these optimization schedules is particularly significant in the context of large data-sets, as batch size can be thought of as a tunable hyper-parameter in the implementation of stochastic mini-batch gradient descent. The tuning of batch size involves the balancing of trained model performance with the practical upper limit on training costs [3] [4] [5] which we will discuss more, later on.
Regular gradient descent requires more computation when compared to SGD since we use the full training set for the computation of the gradient. An intermediate between these two extremes is mini-batch SGD, that uses a subset of the training data. This allows us to operate in a middle ground between the computational expense of finding the full gradient and the sheer amount of computations required to compute the gradient of each training sample individually for each training epoch.
Gradient Descent (GD) | Stochastic Gradient Descent (SGD) | Mini-batch Gradient Descent (Mb- SGD) |
---|---|---|
First order derivative of the objective function calculated using all data points in the training set. | First order derivative of the objective function approximated by one data point in the training set. | First order derivative of the objective function calculated using a subset of data points in the training set. |
Pros: Less number of iterations required for convergence. Less communication between workers. Highly parallelizable | Pros: Cheaper computation at each iteration than GD. Faster overall convergence in terms of FLOPS/wall-clock time | Pros: More parallelizable than SGD. Less computation per iteration than GD. Can tune computation vs. communication depending on batch size |
Cons: Each iteration is expensive to compute. Less gradient diversity. May get stuck in local minima | Cons: Noisy steps towards the minima. Higher number of iterations required for convergence, hence, more communication between workers. Less parallelizable | Cons: Slower convergence in terms of iterations than GD. Batch size is additional hyperparameter to tune. |
As earlier alluded to, one way that the computational expense can be managed is through the use of mini-batches [5]. In a parallelized setting, the operations involved in gradient descent can be broken down into two portions- map: finding the gradient, reduce: average of all estimates of the gradient. The map operation is independently carried out by workers and the results from each worker are collated in the reduce step. Communication between the worker nodes is required between the map and reduce step so that the computed gradient from each worker node is transferred for the final average in the reduce step. The computation cost in the map step increases as mini-batch size increases, as we need to calculate a larger number of gradients. However, communication decreases with increases in mini-batch size, as a lesser number of iterations is required for convergence [6]. When a large cluster is at our disposal, we may be tempted to increase the batch size to the threshold of what is computationally tractable for our problem such that the number of iterations and communication between workers required for each training epoch is minimized and the training converges more quickly. Unfortunately, there are pitfalls associated with increasing batch size, namely the loss of generalization [4].
With increasing batch size gradient diversity decreases. Gradient diversity is the degree to which gradients differ from one another in a mini-batch. If the gradient diversity is low, there may be diminishing returns for using a large batch size relative to a smaller batch size, since the larger mini-batch is not providing much new information. While some researchers have postulated that there is a critical value up to which batch size can be increased without a loss in generalizability [7], we would like to push this threshold further. To motivate the development of stochastic weight averaging in parallel we will first introduce fundamental theory that lies at the core of the optimization problem mini-batch SGD seeks to solve. Then, we present leading theories for the loss of generalizability associated with large batch sizes.
To fully understand these drawbacks, we must first develop a conceptual understanding of non-convexity and how this translates to the loss landscape on which we seek to find a minima. In the simplest sense, a non-convex function is a function that is wavy in appearance. It has some ‘valleys’ (which are referred to as local minima) that are not as deep as the overall deepest ‘valley’ (global minimum). When we optimize a non-convex function, we are never guaranteed to converge to the absolute minimum, the minimum we find is a function of the weight initialization (the starting point) and the tunable hyper-parameters we select for training. Also, the different local minimas can be qualitatively different- i.e., can be sharper or flatter. A flat minimizer is a minimum where the value of the function changes less for a large neighborhood of parameters. In contrast, a sharp minimizer is a minimum where the function changes rapidly in a small neighborhood of parameters. A simple two dimensional example of a non-convex function is shown in the figure below.
The cause for the observed relationship between batch size and model generalizability is still hotly debated by the research community. While numerous theories exist, it is widely agreed that SGD with smaller batch (SB) size is more successful in locating the local minimas that generalize better [4]. This difference can be caused by one of the four following phenomena[4]:
- Large Batch (LB) methods tend to overfit the model
- LB methods are more likely to get stuck in saddle points
- Due to the decrease in stochasticity in LB methods, the optimizer does not ‘explore’ the entire range of the objective, it selects the minima that is closest to the initial point, i.e., has less gradient diversity
- SB and LB converge to different minimas with different qualities and generalization properties
In a similar vein to the above theories, popular related studies have suggested that the depth and width of the valleys in the loss function play a very important role in the observed generalizability of DNNs [4]. The simple explanation for why this might be lies in the fact that the loss landscape in evaluation, and eventually in inference when deployed, is never completely identical to the landscape traversed in training [4]. The intuition here is that if we end up in a sharp minimum in training, it is conceivable that that sharp minima is shifted slightly in the testing loss landscape. In this scenario, it is probable that though the training accuracy is quite high, the model will generalize poorly to other instances which do not appear in the training data-set. We extend our simple example shown above to illustrate this phenomenon.
Our toy example presented here, quite clearly illustrates the theorized problem. In the narrow valley, the optimum point that corresponds to low loss in the training space, corresponds to a much higher loss in the testing/evaluation space. Because of this, we prefer wider valleys, or, flat minima, in the loss landscape, as the shift has a much smaller change in the actualized loss and therefore the model has better generalizability.
While the mechanisms by which batch size affects this behavior is still not fully understood, operating under the shifted loss landscape theory, vanilla mini-batch SGD can be vulnerable to finding these ‘sharp’ loss minima when a large mini-batch size is used [4]. Though smaller mini-batch sizes tend to find solutions that have better generalizability, and therefore exist in wider valleys, using them is less tractable when working with very large data-sets. This is a result of the substantially increased number of training iterations and worker communications needed to complete training which leads to a substantially increased training time. A popular theory for why larger mini-batch SGD is more prone to finding sharper minima is the reduced stochasticity or noise that exists in larger batches [8]. Small batches, which tend to introduce more noise, may be more apt in avoiding sharp minima because of slight shifts in the location of the minimizer, as a result of the noise, which will result in a large increase in loss.
Approaches developed to mitigate the problem
In an effort to mitigate the above described phenomena, and balance the interplay of batch size with generalizability, a range of approaches have been developed. The best approach depends heavily on the nature of the problem addressed, as the amount of local computation and communication required varies substantially among the methods developed to obtain a trained model which generalizes well.
The communication and computation trade off for various techniques of distributed optimization is depicted below.[9]
One-shot averaging
This is an extreme of distributed computing where the amount of local computation is much more than communication. Here, the data is partitioned onto each worker machine. Each worker machine finds its optimal local model. The overall optimal model is an average, obtained by aggregating the local optimum models [6]. However, it has been noted that this method may not reach the optimum solution for both convex and non convex objectives [10].
Ensemble methods
Obtaining good generalizability with LB-SGD has been the scope of many research studies. The studies below have leveraged ensemble methods to reduce communication between workers and obtain desirable test accuracies.
Snapshot ensembling leverages the stochastic nature of SGD. At the start of the training process, SGD makes large leaps across the loss landscape as a result of the large gradients it encounters. Once a local minima is approached, the learning rate is dynamically modulated using cosine annealing. This process, in effect, simulates a restarting of training.
Cosine annealing is an implementation of a cyclic learning rate where the learning rate is reset to a higher value at intervals.
The learning rate is reduced in each cycle, which allows for stable convergence to local minima. Once this is achieved, the resultant model is saved as a “snapshot” and added to an ensemble. The learning rate is then increased again so the minima encountered is overshot so the model can converge to another local solution. The cycles of cosine annealing are sufficiently low in frequency so that considerable swaths of the domain are explored to find a diverse set of solutions within parameter space, by converging to different local minimas. This approach results in a set of models, each effectively trained from scratch. The downside here is that for particularly large architectures it can be somewhat slow in training even when doing a “warm restart”. Furthermore, it may be infeasible to deploy the ensemble of more than one model to many applications, as it may not be sufficiently fast in inference (querying multiple models) and compact enough to run on small, less powerful, machines. More details on this method are provided by the authors of this work in [11].
Fast Geometric Ensembling (FGE) is very similar to snapshot ensembling with the notable change being the implementation of a piecewise cyclical learning rate in place of cosine annealing, with a much higher cycle frequency. The increased cycle frequency has the effect of providing an increased training speed. FGE leverages the existence of a path of low loss between local minimas which it travels along in small steps, obtaining sufficiently different parameter snapshots along the way [12]. While this approach is faster in training than the snapshot ensembling approach discussed above, it still suffers from the limitation of not being as fast in inference as a single instance of the model. Furthermore, as a result of having multiple models, it may also not be compact enough to run on small, less powerful, machines.
The figure below depicts an example of the presence of a low error path between different minimas in a two-parameter space. Here, the colour bar denotes the error magnitude with red denoting low error. [13].
Local updating methods
Local updating methods can have complete independence in choosing the trade-off between computation and communication while keeping model performance constant. These methods have the ability to perform more computation than full batch gradient descent before communicating.
CoCoa is a local updating method where each subproblem is solved to an accuracy of Θ that contributes to the flexibility of trade-off between local computation vs communication. The subproblems are posed as a dual problem framework. Here, instead of minimizing the primal objective over the training examples, a maximization of the dual subproblem for each worker machine is performed up to Θ accuracy. The dual formulation uses gradient ascent for the negative of the objective function on each worker machine. The regularization function is a carefully chosen quadratic function. Smith et al. proves that the primal objective loss is lower bounded by the dual objective for convex functions[14]. The easy separability of the dual across machines allows this method to have complete flexibility of resource allocation, with no loss of generalizability for convex objective functions. However, this method is not well understood for non convex objectives (for e.g., Deep Learning) and is an active area of research.
Stochastic Weight Averaging (SWA)
Distinct from FGE and snapshot ensembling, SWA does not utilize an ensemble model. Instead, SWA uses a set of averaged weights from various solutions to obtain a final model. This method is built on the intuition that weights corresponding to the local minima after each SGD cycle accumulate at the border areas where the loss value is low. It logically then follows that taking an average of these we can obtain the parameter value corresponding to the lowest (or almost lowest) point in the loss basin. Reaching this solution then leads to better generalization properties. This is shown effectively in visual format for a convex objective in the figure below [13]. The original work can be found in [15]. While this approach addresses most of the limitations described above, it can also be further improved upon in terms of training speed and we will discuss a parallelized gradient refinement implementation of this algorithm in detail below which has good training speed and generalizability.
Stochastic Weight Averaging in Parallel (SWAP)
As earlier discussed, the issue with using small-batch (SB) SGD is the increase in amount of overall computation and amount of communication between the workers to update the model. This problem can be tackled by using LB-SGD which generally gives more accurate gradient estimates and require fewer updates and therefore have good overall scaling behavior. However, as already stated, LB- SGD is reported to have poor generalization performance [16]. Thus, this study develops the SWAP process that enables good generalization performance by averaging model weights [17]. With this approach, at the start of training, the model is trained with large batches. Once near convergence is achieved, the obtained model is refined by independent SB-SGD across many workers. The study observes that this method has the same generalization performance as models trained solely with small batches with less communication and better utilization of computing resources. The generalization performance of SWAP is tested on popular computer vision datasets, like, CIFAR10, CIFAR100, and ImageNet. It is seen that SWAP achieves generalization performance comparable to SB models and can do so in training time similar to LB models.
The algorithm is broken down into three phases:
- Phase 1: All workers train a single model by computing large mini-batch updates and using high learning rates.
- Phase 2: Each worker independently refines the model using SB, low learning rate and a random subset of data.
- Phase 3: The model parameters from the workers is then averaged and batch-normalized to produce the final model.
We break down these three phases in simple terms in the following subsections.
Phase 1: Large batch step
The first phase consists of parallelizing SGD across all workers using large mini-batch updates in order to train a single model. Since all the worker machines need to work on training the same model, they synchronize at the end of every iteration. The learning rate used in this phase is higher, to promote full exploration of the loss landscape. This phase is terminated before zero training loss (error) is achieved.
Stopping training early in this manner allows us to prevent the optimizer from getting confined to spaces with very small gradients, so that phase 2 is able to build upon the performance obtained in phase 1. In practice, how early we stop phase one training (the optimal training termination accuracy) is a tunable hyper-parameter.
To promote full exploration of the loss landscape and avoid prematurely converging on a sharp minimum, a variable learning rate is employed. This is a methodology wherein the learning rate fluctuates between two boundary values, instead of monotonically decreasing during training. There is two-fold advantage to using this; it nullifies the necessity of experimenting with multiple learning rates and schedules to see which works best in practice, and it has been found to lead to convergence in a fewer number of iterations in many cases [18]. In this case, cycles of 10 epochs were used, and one model was sampled from the end of each epoch, giving 8 models in total.
Phase 2: Small batch step
Phase 2 of the training process involves taking the synchronized model obtained in phase 1 with large batches, distributing it over all of the workers and refining it on each of the workers with a smaller batch size. The learning rate used to do this is diminished steadily in a linear fashion. This is done because the increased stochasticity that is observed in the SB phase can cause the optimizer to substantially overshoot the targeted minima as it approaches it. Resultantly, as each of the models approaches the locally optimal value, the learning rate is decreased to better ensure convergence in training. The rationale described here is of a similar philosophy that drives the adaptive learning rate that is commonly used across varied implementations of gradient descent [19].
Once each of the worker models has been trained, the set of tunable model parameters obtained can be said to exist on the fringes of the same basin where the loss is locally optimized. To understand why these solutions exist where they do, we can leverage work presented in [20] which shows that SGD with a constant learning rate is an equivalent to sampling from a Gaussian distribution centered at the target minimum with the covariance given as a function of learning rate. Relying upon this understanding, we can interpret individual solutions proposed by SGD as lying on a sphere. The local minimum can be said to exist in the interior of this sphere, inaccessible to vanilla mini-batch SGD.
Phase 3
In phase 3 each set of the worker model weights, which are in effect assumed to be evenly distributed around the target minimum and inaccessible to SGD, are averaged. The averaging and normalizing in this step allows the interior of the sphere described in phase 2 to be accessed. This results in a single proposed model with testing performance typically exceeding that obtained by the worker models.
Method Summary
In effect, SWAP is a method that parallelizes the gradient calculation in SGD using large batches to first find an approximate solution to the optimization problem posed by DNN training. This approximate solution is then refined, independently, by a number of workers to obtain a set of n refined solutions (where n refers to the number of workers). These n solutions are then averaged to provide a final proposed model, with a lower actualized loss than that provided by any of the n refined models.
Model performance on commonly studied vision data-sets
The authors evaluate the performance of the SWAP model on image classification. ResNet 9 is used as the deep learning model for training. Since any deep learning method is a data intensive operation and often requires millions of carefully chosen images to train on [21], there are a few popularly used datasets for tasks like image classification or object segmentation etc. CIFAR 10 and CIFAR 100 consist 60,000 tiny images belonging to 10 classes for CIFAR 10 and 100 classes for CIFAR 100 [22]. This is one of the most popular datasets for machine learning research as it enables quick and easy direct comparison of algorithms and captures the weaknesses and strengths of a particular architecture without an unreasonable computational burden in the training and hyperparameter tuning process [23]. Imagenet consists of 14 million hand-annotated images. Object-level annotations provide a bounding box around the indicated object [23].
CIFAR 10 [24]
The SWAP algorithm has been evaluated on the CIFAR10 dataset with 4096 batch size for phase 1 which was terminated with accuracy of 98%. The training was parallelized over 8 GPUs (512 samples per GPU). Phase 2 of SWAP was implemented across 8 workers on 1 GPU and a batch size of 512 samples distributed across the workers. Phase 1 was carried out for 150 epochs and phase 2 for 100 epochs.
The table below compares the best test accuracies and the training times for each of the models (small-batch, large-batch, SWAP before averaging, SWAP after averaging).
We see that SWAP testing accuracies significantly increase when using the averaging technique.
CIFAR 100 [24]
For CIFAR 100, SWAP phase 1 is implemented with 2048 samples per batch across 8 GPUs and phase 2 is implemented with batch size 128 on 1 GPU (with 8 workers). Phase 1 is carried out for 150 epochs and phase 2 for 10 epochs. The LB and SB models were both trained for 150 epochs each.
While SB models trained on CIFAR 10 and CIFAR 100 achieve a slight increase in test accuracy they require far longer to obtain full convergence.
ImageNet [25]
The small-batch experiments were trained on ImageNet for 28 epochs with a batch size of 256 samples on 8 Tesla V100 GPUs. The large-batch experiments were completed with double the batch size (512 samples) using 16 Tesla V100 GPUs. For SWAP phase 1, the large-batch settings (512 samples) were used for 22 epochs of training. For phase 2, two independent workers were used, each having 8 GPUs using a batch size of 256 samples for 6 epochs.
Identified areas of future improvements
The authors of the work identified a number of promising future directions including implementing any of the following approaches during SWAP training:
- Layer-wise Adaptive Rate Scaling (LARS) [26]. LARS uses a separate learning rate for each layer and not for each weight. The magnitude of the update is controlled with respect to the weight norm. Adapting LARS in SWAP phase 1 can lead to a better approximation of the gradient by LB-SGD that will then reduce training for phase 2 (refinement of weights).
- Mixed-precision training [27]. This is a method that enables training DNNs using half-precision floating point numbers. This method is promising in phase 1 and phase 2 because it nearly halves memory requirements and with a GPU can provide a significant speed up of arithmetic during training.
- Post-local SGD [28]. This is an optimization scheme where the model is first trained with standard SGD and then further refined using local SGD. Local SGD is a distributed training technique that runs SGD independently in parallel on different workers and averages them periodically. The main difference between Post-local SGD and SWAP is the phase 2 refinement, where the iterations required for post-local is in the order of tens whereas SWAP is in the order of thousands.
- NovoGrad [29]. This is an adaptive SGD method that leverages layer-wise gradient normalization and decoupled weight decay for superior performance in a LB setting, and has two times smaller memory demand than other optimizers. Adapting this method can speed up SWAP training for both phase 1 and phase 2.
We propose that the work presented here could be extended in scope by considering the following:
- Principled method hyper-parameter tuning: Presently grid search is conducted to find the optimum stopping point for LB-SGD, but future scope could involve a principled method to find this transition point.
- Introducing sparsity in neural network: Introducing sparsity and forcing a portion of the weights to zero by methods like L1 regularization can further increase computation efficiency when combined with SWAP without loss of accuracy.
Summary
In summation, the study presented in [17] shows that with the implementation of the SWAP algorithm, good generalizability can be achieved even while using large mini batch sizes for a substantial portion of training. Our blog introduces the research challenge of gradient descent optimization and generalizibity of deep learning models. We summarize the research conducted by the paper [17] and discuss related studies that also attempt to mitigate the trade-off between communication and computation while optimizing deep learning models for large datasets. Key research areas that could improve upon the state-of-the-art are also identified and discussed.
References
- Reference 1: S-SGD: Symmetrical Stochastic Gradient Descent for Reaching Flat Minima in Deep Neural Network Training
- Reference 2: Convolutional Neural Network and Convex Optimization
- Reference 3: Revisiting Small Batch Training for Deep Neural Networks
- Reference 4: On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
- Reference 5: On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent
- Reference 6: Parallelized Stochastic Gradient Descent
- Reference 7: The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
- Reference 8: On the Generalization Benefit of Noise in Stochastic Gradient Descent
- Reference 9: Distributed optimization
- Reference 10: Parallel SGD: When does averaging help?
- Reference 11: Snapshot Ensembles: Train 1, Get M for Free
- Reference 12: Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
- Reference 13: Stochastic Weight Averaging — a New Way to Get State of the Art Results in Deep Learning
- Reference 14: CoCoA: A General Framework for Communication-Efficient Distributed Optimization
- Reference 15: Averaging Weights Leads to Wider Optima and Better Generalization
- Reference 16: Accurate, Large Mini-batch SGD: Training ImageNet in 1 Hour.
- Reference 17: Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well
- Reference 18: Cyclical Learning Rates for Training Neural Networks
- Reference 19: Neural Networks with Adaptive Learning Rate and Momentum Terms
- Reference 20: Stochastic Gradient Descent as Approximate Bayesian Inference
- Reference 21: Open Source Datasets for Computer Vision
- Reference 22: 80 milliontiny images.
- Reference 23: Open Source Computer Vision Datasets.
- Reference 24: Learning Multiple Layers of Features from Tiny Images
- Reference 25: Image Net
- Reference 26: Large Batch Training of Convolutional Networks
- Reference 27: Mixed Precision Training
- Reference 28: Don’t Use Large Mini-Batches, Use Local SGD
- Reference 29: Training Deep Networks with Stochastic Gradient Normalized by Layer-wise Adaptive Second Moments