Keywords: robustness, domain adaptation, spurious correlation, dataset bias
Abstract: Machine learning has demonstrated remarkable prediction accuracy over i.i.d data, but the accuracy often drops when tested with data from another distribution. One reason behind this accuracy drop is the reliance of models on the features that are only associated with the label in the training distribution, but not the test distribution. This problem is usually known as spurious correlation, confounding factors, or dataset bias. In this paper, we formally study the generalization error bound for this setup with the knowledge of how the spurious features are associated with the label. We also compare our analysis to the widely-accepted domain adaptation error bound and show that our bound can be tighter, with more assumptions that we consider realistic. Further, our analysis naturally offers a set of solutions for this problem, linked to established solutions in various topics about robustness in general, and these solutions all require some understandings of how the spurious features are associated with the label. Finally, we also briefly discuss a method that does not require such an understanding.
One-sentence Summary: We offer a formal generalization error bound of the problem of learning when there are spurious correlated features, with the knowledge of these features. Our bound also leads to discussion of the methods for this problem.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Supplementary Material: zip
Reviewed Version (pdf): https://openreview.net/references/pdf?id=vPUE2QzXj
6 Replies
Loading