Scalable Stochastic Gradient Riemannian Langevin Dynamics in Non-Diagonal Metrics
Abstract: Stochastic-gradient sampling methods are often used to perform Bayesian inference on neural networks. It has been observed that the methods in which notions of differential geometry are included tend to have better performances, with the Riemannian metric improving posterior exploration by accounting for the local curvature. However, the existing methods often resort to simple diagonal metrics to remain computationally efficient. This loses some of the gains. We propose two non-diagonal metrics that can be used in stochastic-gradient samplers to improve convergence and exploration but have only a minor computational overhead over diagonal metrics. We show that for fully connected neural networks (NNs) with sparsity-inducing priors and convolutional NNs with correlated priors, using these metrics can provide improvements. For some other choices the posterior is sufficiently easy also for the simpler metrics.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: This revision addresses the comments of all reviewers. Most of the changes are clarifications related to specific comments. In addition, the revised paper now: - Explains how the Monge metric differs from the empirical estimates of Fisher Information - Includes additional experimental results (in Appendix) in form of Expected Calibration Error (ECE) and confidence - Makes the conclusions regarding the value of Monge metric more clear Due to the changes the main content is now slightly longer than 12 pages; if needed, we can still compress the paper back to 12 pages of main content. --- This revision includes fixes and clarifications, mainly addressing the comments of reviewer 3sW2, while also improving the wordings when introducing pSGLD and fixing the expression of empirical Fisher in the Appendix, etc. Major changes based on the comments of reviewer 3sW2 are listed below - Incorporates the clarifications regarding the roles of the theorems and the description of ECE - Improves the discussions concerning the connections of the metrics with Fisher Information - Improves the reasoning of using log-likelihood as an evaluation measure --- This revision concerns improvements of Appendix 4 and Appendix 5. We add a detail of the employed RMSprop metric to Appendix 5, and fix multiple issues in Appendix 4 and Appendix 5. --- Camera ready revision. De-anonymized the paper. Added acknowledgements and the link to the GitHub repository containing the code, etc. Toned down the wording regarding the comparison against previous approaches, minor fixes and improvements of words and sentences, etc. Improved the code with better descriptions, etc.
Supplementary Material: zip
Assigned Action Editor: ~Jasper_Snoek1
Submission Number: 1174