Keywords: stochastic gradient descent, single-neuron networks, auto-encoders, implicit bias
TL;DR: We analyze the gradient dynamics of a simple single-neuron auto-encoder trained on orthogonal data and demonstrate an interesting dependence on the choice of batch-size.
Abstract: In this work we investigate the dynamics of (stochastic) gradient descent when training a single-neuron ReLU autoencoder on orthogonal inputs. We show that for this non-convex problem there exists a manifold of global minima all with the same maximum Hessian eigenvalue and that gradient descent reaches a particular global minimum when initialized randomly. Interestingly, which minimum is reached depends heavily on the batch-size. For full batch gradient descent, the directions of the neuron that are initially positively correlated with the data are merely rescaled uniformly, hence in high-dimensions the learned neuron is a near uniform mixture of these directions. On the other hand, with batch-size one the neuron exactly aligns with a single such direction, showing that when using a small batch-size a qualitatively different type of ``feature selection" occurs.