Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization
Track: regular paper (up to 6 pages)
Keywords: high-dimensional statistics, empirical risk minimization, spurious correlations, linear regression, random features
TL;DR: We provide a quantitative characterization of how spurious correlations are learned in high-dimensional linear and random features models. We analyze the effects of regularization, simplicity of the spurious features and over-parameterization.
Abstract: Learning models have been shown to rely on spurious correlations between non-predictive features and the associated labels in the training data, with negative implications on robustness, bias and fairness.
In this work, we provide a statistical characterization of this phenomenon for high-dimensional regression, when the data contains a predictive *core* feature $x$ and a *spurious* feature $y$. Specifically, we quantify the amount of spurious correlations $\mathcal C$ learned via linear regression, in terms of the data covariance and the strength $\lambda$ of the ridge regularization. As a consequence, we first capture the simplicity of $y$ through the spectrum of its covariance, and its correlation with $x$ through the Schur complement of the full data covariance. Next, we prove a trade-off between $\mathcal C$ and the in-distribution test loss $\mathcal L$, by showing that the value of $\lambda$ that minimizes $\mathcal L$ lies in an interval where $\mathcal C$ is increasing. Finally, we investigate the effects of over-parameterization via the random features model, by showing its equivalence to regularized linear regression.
Our theoretical results are supported by numerical experiments on Gaussian, Color-MNIST, and CIFAR-10 datasets.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Presenter: ~Simone_Bombari1
Submission Number: 4
Loading