Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering ModelsDownload PDF

Anonymous

16 Dec 2022 (modified: 05 May 2023)ACL ARR 2022 December Blind SubmissionReaders: Everyone
Abstract: While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modeling spurious correlations of training datasets. Authors commonly assess model robustness by evaluating their models on out-of-distribution (OOD) datasets of the same task, but these datasets might share the biases of the training dataset. We propose a framework for measuring a scale of models' reliance on any identified spurious feature and measure the size of such reliance for some previously-reported features while uncovering several new ones.We assess the robustness towards a large set of known and new-found prediction biases for a variety of pre-trained models and state-of-the-art debiasing methods in Question Answering (QA) and compare it to a resampling baseline. We find that (i) the observed OOD gains of debiasing methods can not be explained by mitigation or enlargement of the addressed bias and subsequently evaluate that (ii) the biases are vastly shared among QA datasets. Our findings motivate future work to refine the reports of LLMs' robustness to a level of specific spurious correlations.
Paper Type: long
Research Area: Question Answering
0 Replies

Loading