Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension

ACL ARR 2024 June Submission3743 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As neural language models achieve human-comparable performance on Machine Reading Comprehension (MRC) and see widespread adoption, ensuring their robustness in real-world scenarios has become increasingly important. Current robustness evaluation research, though, primarily develops synthetic perturbation methods, leaving unclear how well they reflect real life. Considering this, we present a framework to automatically examine MRC models on occurring textual perturbations, by replacing paragraph in MRC benchmarks with their counterparts based on available Wikipedia edit history. Such perturbation type is natural as its design does not stem from an artificial generative process, inherently distinct from the previously investigated synthetic approaches. In a large-scale study encompassing various model architectures we observe that natural perturbations result in performance degradation in pre-trained encoder langauge models, with errors extending to Flan-T5 and Large Language Models (LLMs). We also show that exposing encoder-only models to naturally perturbed examples during training contributes to handling natural perturbations. This adversarial training approach, however, is not able to promote performance improvement on the majority of synthetic perturbations, indicating that many types of synthetic noise do not actually exist in our collected real-world textual perturbations. We hope this study will inspire future robustness investigation efforts to focus more on natural perturbations, thus deepening our understanding of how models respond to realistic linguistic challenges and providing insights into practical robustness enhancement strategies.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: adversarial attacks/examples/training, data shortcuts/artifacts, hardness of samples, human-subject application-grounded evaluations, probing, robustness
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 3743
Loading