Understanding Very Deep Networks via Volume Conservation

Thomas Unterthiner, Sepp Hochreiter

Feb 18, 2016 (modified: Feb 18, 2016) ICLR 2016 workshop submission readers: everyone
  • Abstract: Recently, very deep neural networks set new records across many application domains, like Residual Networks at the ImageNet challenge and Highway Networks at language processing tasks. We expect further excellent performance improvements in different fields from these very deep networks. However these networks are still poorly understood, especially since they rely on non-standard architectures. In this contribution we analyze the learning dynamics which are required for successfully training very deep neural networks. For the analysis we use a symplectic network architecture which inherently conserves volume when mapping a representation from one to the next layer. Therefore it avoids the vanishing gradient problem, which in turn allows to effectively train thousands of layers. We consider highway and residual networks as well as the LSTM model, all of which have approximately volume conserving mappings. We identified two important factors for making deep architectures working: (1) (near) volume conserving mappings through $x = x + f(x)$ or similar (cf.\ avoiding the vanishing gradient); (2) Controlling the drift effect, which increases/decreases $x$ during propagation toward the output (cf.\ avoiding bias shifts);
  • Conflicts: jku.at