Natural Context Drift Undermines the Natural Language Understanding of Large Language Models

ACL ARR 2025 May Submission7477 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: How does the natural evolution of context paragraphs affect question answering in generative Large Language Models (LLMs)? To investigate this, we propose a framework for curating naturally evolved, human-edited variants of reading passages from contemporary QA benchmarks and for analyzing LLM performance across a range of semantic similarity scores, which measure how closely each variant aligns with content seen during pretraining. Using this framework, we evaluate six QA datasets and eight LLMs with publicly available training data. Our experiments reveal that LLM performance declines as reading passages naturally diverge from the versions encountered during pretraining—even when the question and all necessary information remain present at inference time. For instance, average model accuracy on BOOLQ drops by over 25% from the highest to lowest similarity bins, with slopes exceeding 70 across several LLMs. These suggest that natural text evolution poses a significant challenge to the language understanding capabilities of LLMs.
Paper Type: Short
Research Area: Question Answering
Research Area Keywords: reading comprehension, generalization
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 7477
Loading