Rethinking Robustness Evaluation for Question Answering: From Synthetic Stress Tests to Natural Language Variation

Rethinking Robustness Evaluation for Question Answering: From Synthetic Stress Tests to Natural Language Variation

ACL ARR 2026 January Submission5065 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: adversarial attacks/examples/training, probing, robustness

Abstract: In this position paper, we contend that prevailing robustness evaluation practices for Question Answering (QA) do not adequately capture system behavior under real-world conditions. Current evaluations predominantly rely on synthetic perturbations defined by idealized assumptions about linguistic validity and label preservation, whose relevance to deployment scenarios is often unclear. Consequently, robustness assessed in such settings may provide a distorted view of the reliability of QA systems. This limitation becomes increasingly salient as Large Language Models (LLMs) are deployed in interactive and agent-based applications, where language variation emerges organically and compounds across multiple interactions. We examine commonly adopted synthetic perturbation paradigms, analyze their limitations, and contrast them with emerging efforts that evaluate robustness using naturally occurring perturbations. Building on this analysis, we advocate for a community-wide shift toward robustness evaluation grounded in real-world language variation and more reliable evaluation protocols.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: adversarial attacks/examples/training, probing, robustness

Contribution Types: Position papers

Languages Studied: English

Submission Number: 5065

Loading