Sparse Coding and Autoencoders

Akshay Rangamani, Anirbit Mukherjee, Amitabh Basu, Ashish Arora, Tejaswini Ganapathi, Sang Peter Chin, Trac D. Tran

2018 (modified: 08 Nov 2022)ISIT 2018Readers: Everyone

Abstract: In this work we study the landscape of squared loss of an Autoencoder when the data generative model is that of “Sparse Coding”/“Dictionary Learning”. The neural net considered is an <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbb{R}^{n}\rightarrow \mathbb{R}^{n}$</tex> mapping and has a single ReLU activation layer of size <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$h > n$</tex> . The net has access to vectors <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$y\in \mathbb{R}^{n}$</tex> obtained as <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$y=A^{\ast}x^{\ast}$</tex> where <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$x^{\ast}\in \mathbb{R}^{h}$</tex> are sparse high dimensional vectors and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$A^{\ast}\in \mathbb{R}^{n\times h}$</tex> is an overcomplete incoherent matrix. Under very mild distributional assumptions on <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$x^{\ast}$</tex> , we prove that the norm of the expected gradient of the squared loss function is asymptotically (in sparse code dimension) negligible for all points in a small neighborhood of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$A^{\ast}$</tex> . This is supported with experimental evidence using synthetic data. We conduct experiments to suggest that <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$A^{\ast}$</tex> sits at the bottom of a well in the landscape and we also give experiments showing that gradient descent on this loss function gets columnwise very close to the original dictionary even with far enough initialization. Along the way we prove that a layer of ReLU gates can be set up to automatically recover the support of the sparse codes. Since this property holds independent of the loss function we believe that it could be of independent interest. A full version of this paper is accessible at: https://arxiv.org/abs/1708.03735

0 Replies