Small nonlinearities in activation functions create bad local minima in neural networks

Chulhee Yun; Suvrit Sra; Ali Jadbabaie

Small nonlinearities in activation functions create bad local minima in neural networks

Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Published: 21 Dec 2018, Last Modified: 05 May 2023ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minim" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like) networks we constructively prove that for almost all practical datasets there exist infinitely many local minima. We also present a counterexample for more general activations (sigmoid, tanh, arctan, ReLU, etc.), for which there exists a bad local minimum. Our results make the least restrictive assumptions relative to existing results on spurious local optima in neural networks. We complete our discussion by presenting a comprehensive characterization of global optimality for deep linear networks, which unifies other results on this topic.

Keywords: spurious local minima, loss surface, optimization landscape, neural network

TL;DR: We constructively prove that even the slightest nonlinear activation functions introduce spurious local minima, for general datasets and activation functions.

10 Replies

Loading