Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
Decoupling the Layers in Residual Networks
Nov 03, 2017 (modified: Nov 03, 2017)ICLR 2018 Conference Blind Submissionreaders: everyoneShow Bibtex
Abstract:We propose the Warped Residual Network using a parallelizable warp operator for forward and backward propagation to distant layers that trains faster than the original residual neural network. We apply a perturbation theory on residual networks and decouple the interactions between residual units. The resulting warp operator is a ﬁrst order approximation of the output over multiple layers. The ﬁrst order perturbation theory exhibits properties such as binomial path lengths and exponential gradient scaling found experimentally by Veit et al. (2016). We show that the Warped Residual Networks learn invariant models by breaking the redundancy in the weights caused by local symmetries in the input. The proposed network can outperform or achieve comparable predictive performance to the original residual network with the same number of parameters, while achieving a signiﬁcant speedup on the training time. We ﬁnd that as the layers get wider, the speedup in training time (44% with the widest architecture) is closer to the optimal speedup of 50% for skipping over one residual unit. Our architecture opens up a new research direction for methods to train a K layer ResNets in O(1) time as opposed to O(K) time with forward and backward propagation.
TL;DR:We propose the Warped Residual Network using a parallelizable warp operator for forward and backward propagation to distant layers that trains faster than the original residual neural network.