- TL;DR: This paper presents how the loss surfaces of nonlinear neural networks are substantially shaped by the nonlinearities in activations.
- Abstract: Understanding the loss surfaces of neural networks is fundamentally important to understanding deep learning. This paper presents how the nonlinearities in activations substantially shape the loss surfaces of neural networks. We first prove that the loss surface of every neural network has infinite spurious local minima, which are defined as the local minima with higher empirical risks than the global minima. Our result holds for any neural network with arbitrary depth and arbitrary piecewise linear activation functions (excluding linear functions) under most loss functions in practice. This result demonstrates that nonlinear networks possess substantial differences to the well-studied linear neural networks. Essentially, the underlying assumptions for the above result are consistent with most practical circumstances where the output layer is narrower than any hidden layer. We further prove a theorem that draws a big picture for the loss surfaces of nonlinear neural networks from the following respects. (1) Smooth and multilinear partition: the loss surface is partitioned into multiple smooth and multilinear open cells. (2) Local analogous convexity: within every cell, local minima are equally good, and equivalently, they are all global minima in the cell. (3) Local minima valley: some local minima are concentrated into a valley in some cell, sharing the same empirical risk. (4) Linear collapse: when all activations are linear, the partitioned loss surface collapses to one single cell, which includes linear neural networks as a simplified case. The second result holds for one-hidden-layer networks for regression under convex loss, while all others apply to networks of arbitrary depth.
- Keywords: neural network, nonlinearity, loss surface, spurious local minimum