VARIATIONAL STOCHASTIC GRADIENT DESCENTDownload PDF

19 Apr 2024 (modified: 16 Feb 2016)ICLR 2016 workshop submissionReaders: Everyone
Abstract: In Bayesian approach to probabilistic modeling of data we select a model for probabilities of data that depends on a continuous vector of parameters. For a given data set Bayesian theorem gives a probability distribution of the model parameters. Then the inference of outcomes and probabilities of new data could be found by averaging over the parameter distribution of the model, which is an intractable problem. In this paper we propose to use Variational Bayes (VB) to estimate Gaussian posterior of model parameters for a given Gaussian prior and Bayesian updates in a form that resembles SGD rules. It is shown that with incremental updates of posteriors for a selected sequence of data points and a given number of iterations the variational approximations are defined by a trajectory in space of Gaussian parameters, which depends on a starting point defined by priors of the parameter distribution, which are true hyper-parameters. The same priors are providing a weight decay or L2 regularization for the training. Then a selection of L2 regularization parameters and a number of iterations is completely defining a learning rule for VB SGD optimization, unlike other methods with momentum (Duchi et al., 2011; Kingma & Ba, 2014; Zeiler, 2012) that need selecting learning, regularization rates, etc., separately. We consider application of VB SGD for important practical case of fast training neural networks with very large data. While the speedup is achieved by partitioning data and training in parallel the resulting set of solutions obtained with VB SGD forms a Gaussian mixture. By applying VB SGD optimization to the Gaussian mixture we can merge multiple neural networks of same dimensions into a new single neural network that has almost the same performance as an original Gaussian mixture.
Conflicts: InvenSense.com
3 Replies

Loading