Adding Gradient Noise Improves Learning for Very Deep Networks

Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Lukasz Kaiser, Karol Kurach, Ilya Sutskever, James Martens

Nov 04, 2016 (modified: Dec 20, 2016) ICLR 2017 conference submission readers: everyone
  • Abstract: Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we explore the low-overhead and easy-to-implement optimization technique of adding annealed Gaussian noise to the gradient, which we find surprisingly effective when training these very deep architectures. Unlike classical weight noise, gradient noise injection is complementary to advanced stochastic optimization algorithms such as Adam and AdaGrad. The technique not only helps to avoid overfitting, but also can result in lower training loss. We see consistent improvements in performance across an array of complex models, including state-of-the-art deep networks for question answering and algorithm learning. We observe that this optimization strategy allows a fully-connected 20-layer deep network to escape a bad initialization with standard stochastic gradient descent. We encourage further application of this technique to additional modern neural architectures.
  • TL;DR: Adding annealed Gaussian noise to the gradient improves training of neural networks in ways complementary to adaptive learning algorithms and the noise introduced by SGD.
  • Conflicts: cs.umass.edu, google.com, openai.com, cs.toronto.edu

Loading