Decoupling the Layers in Residual Networks

Anonymous

Nov 03, 2017 (modified: Nov 03, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: We propose the Warped Residual Network using a parallelizable warp operator for forward and backward propagation to distant layers that trains faster than the original residual neural network. We apply a perturbation theory on residual networks and decouple the interactions between residual units. The resulting warp operator is a first order approximation of the output over multiple layers. The first order perturbation theory exhibits properties such as binomial path lengths and exponential gradient scaling found experimentally by Veit et al. (2016). We show that the Warped Residual Networks learn invariant models by breaking the redundancy in the weights caused by local symmetries in the input. The proposed network can outperform or achieve comparable predictive performance to the original residual network with the same number of parameters, while achieving a significant speedup on the training time. We find that as the layers get wider, the speedup in training time (44% with the widest architecture) is closer to the optimal speedup of 50% for skipping over one residual unit. Our architecture opens up a new research direction for methods to train a K layer ResNets in O(1) time as opposed to O(K) time with forward and backward propagation.
  • TL;DR: We propose the Warped Residual Network using a parallelizable warp operator for forward and backward propagation to distant layers that trains faster than the original residual neural network.
  • Keywords: Warped residual networks, residual networks, symmetry breaking

Loading