DEMYSTIFYING WIDE NONLINEAR AUTO-ENCODERS: FAST SGD CONVERGENCE TOWARDS SPARSE REPRESENTATION FROM RANDOM INITIALIZATION

Anonymous

Nov 07, 2017 (modified: Nov 07, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Auto-encoder is commonly used for unsupervised representation learning and for pre-training deeper neural networks. When its activation function is linear and the encoding dimension (width of hidden layer) is smaller than the input dimension, it is well known that auto-encoder is optimized to learn the principal components of the data distribution Oja (1982). However, when the activation is nonlinear and when the width is larger than the input dimension, auto-encoder behaves differently from PCA, with the ability to capture multi-modal aspects of the input distribution. We provide a theoretical explanation for this empirically observed phenomenon, when rectified-linear unit (ReLu) is adopted as the activation function and the hidden-layer width is set to be large. In this case, we show that, with significant probability, initializing the weight matrix of an auto-encoder by sampling from a spherical Gaussian distribution followed by stochastic gradient descent (SGD) training converges towards the ground-truth representation for a class of sparse dictionary learning models. In addition, we can show that, conditioning on convergence, the expected convergence rate is O(1/t), where t is the number of updates. Our analysis quantifies how increasing hidden layer width helps the training performance when random initialization is used, and how the norm of network weights influence the speed of SGD convergence.
  • TL;DR: theoretical analysis of nonlinear wide autoencoder
  • Keywords: stochastic gradient descent, autoencoders, nonconvex optimization, representation learning, theory

Loading