How rotation invariant algorithms are fooled by noise on sparse targets
Abstract: It is well known that rotation invariant algorithms are sub-optimal for learning sparse linear problems,
when the number of examples is below the input dimension. This includes any gradient descent
trained neural net with a fully-connected input layer initialized with a rotationally symmetric
distribution. The simplest sparse problem is learning a single feature out of d features. In that case
the classification error or regression loss of rotation invariant algorithms grows with 1−n/d, where n
is the number of examples seen. These lower bounds become vacuous when the number of examples
n reaches the dimension d. After d examples, the gradient space has full rank and any weight vector
can be expressed, including the unit vector that determines the target feature.
In this work, we show that when noise is added to this sparse linear problem, rotation invariant
algorithms are still sub-optimal after seeing d or more examples. We prove this via a lower bound
for the Bayes optimal algorithm on a rotationally symmetrized problem. We then prove much
lower upper bounds on the same problem for a large variety of algorithms that are non-invariant by
rotations.
Finally, we analyze the gradient flow trajectories of many standard optimization algorithms
(such as AdaGrad) on the same noisy feature learning problem, and show how they veer away from
the noisy sparse targets. We then contrast them with a group of non-rotation invariant algorithms
that veer towards the sparse targets.
We believe that the lower bounds method and trajectory categorization will be crucial for
analyzing other families of algorithms with different classes of invariances.
PDF: pdf
Submission Number: 59
Loading