- TL;DR: We characterize the stability and convergence of gradient descent learning ResNet, unveiling the theorectical and practical importance of tau =1/sqrt(L) in the residual block.
- Abstract: ResNet structure has achieved great success since its debut. In this paper, we study the stability of learning ResNet. Specifically, we consider the ResNet block $h_l = \phi(h_{l-1}+\tau\cdot g(h_{l-1}))$ where $\phi(\cdot)$ is ReLU activation and $\tau$ is a scalar. We show that for standard initialization used in practice, $\tau =1/\Omega(\sqrt{L})$ is a sharp value in characterizing the stability of forward/backward process of ResNet, where $L$ is the number of residual blocks. Specifically, stability is guaranteed for $\tau\le 1/\Omega(\sqrt{L})$ while conversely forward process explodes when $\tau>L^{-\frac{1}{2}+c}$ for a positive constant $c$. Moreover, if ResNet is properly over-parameterized, we show for $\tau \le 1/\tilde{\Omega}(\sqrt{L})$ gradient descent is guaranteed to find the global minima \footnote{We use $\tilde{\Omega}(\cdot)$ to hide logarithmic factor.}, which significantly enlarges the range of $\tau\le 1/\tilde{\Omega}(L)$ that admits global convergence in previous work. We also demonstrate that the over-parameterization requirement of ResNet only weakly depends on the depth, which corroborates the advantage of ResNet over vanilla feedforward network. Empirically, with $\tau\le1/\sqrt{L}$, deep ResNet can be easily trained even without normalization layer. Moreover, adding $\tau=1/\sqrt{L}$ can also improve the performance of ResNet with normalization layer.
- Keywords: ResNet, stability, convergence theory, over-parameterization
0 Replies