Evaluating Linguistic Robustness of Large Language Models for Question Answering: A Study on Consumer Health Queries

Evaluating Linguistic Robustness of Large Language Models for Question Answering: A Study on Consumer Health Queries

ACL ARR 2025 May Submission6405 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Question-Answering (QA) systems powered by Large Language Models (LLMs) increasingly enable interactive access to essential information across diverse domains. However, the robustness of these systems to variations in linguistic style, such as differences in reading level, formality, or domain-specific terminology, remains underexplored. To systematically address this gap, we propose the Style Perturbed Question Answering (SPQA) framework. SPQA systematically perturbs original questions to produce linguistically diverse variants and evaluates model responses to both original and perturbed queries based on correctness, completeness, coherence, and linguistic adaptability. Given the critical importance of accessible and medically accurate health information, we specifically apply SPQA to consumer health QA. Using a scalable evaluation pipeline combining automated style-transfer methods with a rigorously validated GPT-4o-based automated evaluation approach, we benchmark several state-of-the-art LLMs. Our results demonstrate substantial performance declines under realistic stylistic perturbations, highlighting significant challenges related to equity, reliability, and robustness in consumer-facing QA systems, especially in sensitive domains like healthcare.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: linguistic variation, style analysis, style generation, conversational QA, healthcare applications, robustness

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 6405

Loading