- Abstract: Auto-encoders are commonly used for unsupervised representation learning and for pre-training deeper neural networks. When its activation function is linear and the encoding dimension (width of hidden layer) is smaller than the input dimension, it is well known that auto-encoder is optimized to learn the principal components of the data distribution (Oja1982). However, when the activation is nonlinear and when the width is larger than the input dimension (overcomplete), auto-encoder behaves differently from PCA, and in fact is known to perform well empirically for sparse coding problems. We provide a theoretical explanation for this empirically observed phenomenon, when rectified-linear unit (ReLu) is adopted as the activation function and the hidden-layer width is set to be large. In this case, we show that, with significant probability, initializing the weight matrix of an auto-encoder by sampling from a spherical Gaussian distribution followed by stochastic gradient descent (SGD) training converges towards the ground-truth representation for a class of sparse dictionary learning models. In addition, we can show that, conditioning on convergence, the expected convergence rate is O(1/t), where t is the number of updates. Our analysis quantifies how increasing hidden layer width helps the training performance when random initialization is used, and how the norm of network weights influence the speed of SGD convergence.
- TL;DR: theoretical analysis of nonlinear wide autoencoder
- Keywords: stochastic gradient descent, autoencoders, nonconvex optimization, representation learning, theory