Symmetry-Breaking Convergence Analysis of Certain Two-layered Neural Networks with ReLU nonlinearity

Yuandong Tian

Feb 17, 2017 (modified: Feb 17, 2017) ICLR 2017 workshop submission readers: everyone
  • Abstract: In this paper, we use dynamical system to analyze the nonlinear weight dynamics of two-layered bias-free networks in the form of $g(\vx; \vw) = \sum_{j=1}^K \sigma(\vw_j\trans\vx)$, where $\sigma(\cdot)$ is ReLU nonlinearity. We assume that the input $\vx$ follow Gaussian distribution. The network is trained using gradient descent to mimic the output of a teacher network of the same size with fixed parameters $\vw\opt$ using $l_2$ loss. We first show that when $K = 1$, the nonlinear dynamics can be written in close form, and converges to $\vw\opt$ with at least $(1-\epsilon)/2$ probability, if random weight initializations of proper standard derivation ($\sim 1/\sqrt{d}$) is used, verifying empirical practice~\cite{xavier, PReLU,lecun2012efficient}. For networks with many ReLU nodes ($K \ge 2$), we apply our close form dynamics and prove that when the teacher parameters $\{\vw\opt_j\}_{j=1}^K$ forms orthonormal bases, (1) a symmetric weight initialization yields a convergence to a saddle point and (2) a certain symmetry-breaking weight initialization yields global convergence to $\vw\opt$ without local minima. To our knowledge, this is the first proof that shows global convergence in nonlinear neural network without unrealistic assumptions on the independence of ReLU activations. In addition, we also give a concise gradient update formulation for a multilayer ReLU network when it follows a teacher of the same size with $l_2$ loss. Simulations verify our theoretical analysis.
  • TL;DR: We find a close-form gradient formula for two-layered ReLU network and apply it for convergence analysis
  • Conflicts:
  • Keywords: Theory, Deep learning