Robustness Evaluation in Natural Language Understanding: A Survey and Perspective in the Era of Large Language Models

Robustness Evaluation in Natural Language Understanding: A Survey and Perspective in the Era of Large Language Models

ACL ARR 2025 July Submission1343 Authors

29 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As Large Language Models (LLMs) increasingly serve as the backbone of modern Question Answering (QA) systems, ensuring their robustness to input variation has become a critical concern. In this paper, we survey the trajectory of robustness evaluation for QA, with a particular focus on perturbation-based methods applied to textual input. We first review synthetic perturbation approaches developed for earlier neural models and discuss their continued relevance and adaptation to recent LLMs. We then examine natural perturbations, which originate from real-world language variation and provide a more realistic basis for evaluating robustness in practical scenarios. Based on our analysis, we identify key limitations in current robustness research and advocate for a shift toward evaluation methodologies that emphasize natural linguistic variability. We also outline future research directions, including the need for systematic evaluation protocols, a deeper understanding of robustness in the context of LLM-based QA, and explicit consideration of benchmark leakage when evaluating the robustness of LLMs.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking; evaluation methodologies; evaluation

Contribution Types: Surveys

Languages Studied: English

Submission Number: 1343

Loading