Keywords: spurious correlation, pretrained language models
Abstract: Machine learning models are known to exploit spurious features: features that are predictive during training (e.g., the exclamation mark) but are not useful in general (e.g., the exclamation mark does not imply sentiment). Relying on such features may result in significant performance drops under distribution shifts. Recent work has found that Pretrained Language Models (PLMs) improve robustness against spurious features. However, existing evaluation of PLMs only focuses on a small set of spurious features, painting a limited picture of the inductive bias in PLMs. In this work, we conduct a comprehensive empirical analysis to compare the generalization patterns of PLMs on diverse categories of spurious features as a way to analyze the inductive biases of PLMs. We find systematic patterns when finetuning BERT and few-shot prompting GPT-3: they exploit certain types of spurious features (e.g., content words) to a much larger extent than others (e.g., function words). Our findings inform the kinds of settings where pretraining alone can be expected to confer robustness, and the kinds of spurious features where other mitigation methods are necessary, for which we also study how different finetuning and prompting methods affect the robustness of PLMs.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)