Decoupling the Layers in Residual Networks

Ricky Fok, Aijun An, Zana Rashidi, Xiaogang Wang

Feb 15, 2018 (modified: Oct 21, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: We propose a Warped Residual Network (WarpNet) using a parallelizable warp operator for forward and backward propagation to distant layers that trains faster than the original residual neural network. We apply a perturbation theory on residual networks and decouple the interactions between residual units. The resulting warp operator is a first order approximation of the output over multiple layers. The first order perturbation theory exhibits properties such as binomial path lengths and exponential gradient scaling found experimentally by Veit et al (2016). We demonstrate through an extensive performance study that the proposed network achieves comparable predictive performance to the original residual network with the same number of parameters, while achieving a significant speed-up on the total training time. As WarpNet performs model parallelism in residual network training in which weights are distributed over different GPUs, it offers speed-up and capability to train larger networks compared to original residual networks.
  • TL;DR: We propose the Warped Residual Network using a parallelizable warp operator for forward and backward propagation to distant layers that trains faster than the original residual neural network.
  • Keywords: Warped residual networks, residual networks
0 Replies

Loading