Behavior of Mini-Batch Optimization for Training Deep Neural Networks on Large Datasets
Keywords: Stochastic gradient descent, large datasets, convex optimization, parallelized computation, deep neural networks
Abstract: Stochastic Weight Averaging in Parallel (SWAP) is a method that enables the training of deep neural networks on large datasets using large mini-batch sizes while also not sacrificing good generalization behavior. The algorithm uses large mini-batches to calculate the approximate model weights. The final model weight is the average of refined weights through parallel small mini-batch training of the approximate weights. This post provides a summary of the paper presenting SWAP, in addition to providing simple explanations of both related and foundational concepts upon which SWAP builds. Important concepts such as convexity, generalizability and gradient descent are explained. Related approaches that aim to obtain good generalization properties for large mini batches, like, ensemble of model parameters and local updating methods are discussed. Model performance of SWAP is presented for the task of image classification by deep learning models using popular computer vision benchmark datasets, such as, CIFAR 10, CFAIR 100, and Image Net. Further possible improvements identified by the authors are elucidated upon and additional future directions are identified and explained by the post authors.
Submission Full: zip
Blogpost Url: yml
ICLR Paper: https://arxiv.org/pdf/2001.02312.pdf
2 Replies
Loading