Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
DEMYSTIFYING WIDE NONLINEAR AUTO-ENCODERS: FAST SGD CONVERGENCE TOWARDS SPARSE REPRESENTATION FROM RANDOM INITIALIZATION
Nov 07, 2017 (modified: Nov 07, 2017)ICLR 2018 Conference Blind Submissionreaders: everyoneShow Bibtex
Abstract:Auto-encoder is commonly used for unsupervised representation learning and for
pre-training deeper neural networks. When its activation function is linear and the
encoding dimension (width of hidden layer) is smaller than the input dimension,
it is well known that auto-encoder is optimized to learn the principal components
of the data distribution Oja (1982). However, when the activation is nonlinear
and when the width is larger than the input dimension, auto-encoder behaves differently
from PCA, with the ability to capture multi-modal aspects of the input
We provide a theoretical explanation for this empirically observed phenomenon,
when rectified-linear unit (ReLu) is adopted as the activation function and the
hidden-layer width is set to be large. In this case, we show that, with significant
probability, initializing the weight matrix of an auto-encoder by sampling
from a spherical Gaussian distribution followed by stochastic gradient descent
(SGD) training converges towards the ground-truth representation for a class of
sparse dictionary learning models. In addition, we can show that, conditioning
on convergence, the expected convergence rate is O(1/t), where t is the number
of updates. Our analysis quantifies how increasing hidden layer width helps the
training performance when random initialization is used, and how the norm of
network weights influence the speed of SGD convergence.
TL;DR:theoretical analysis of nonlinear wide autoencoder
Keywords:stochastic gradient descent, autoencoders, nonconvex optimization, representation learning, theory
Enter your feedback below and we'll get back to you as soon as possible.