Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

ACL ARR 2025 May Submission6907 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this short paper, we propose a “Generalization Stress Test” to assess Large Language Models' (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We achieve novel and significant findings that, despite high benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B's MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT4o experiences a 25-point accuracy loss when problem types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts.

Paper Type: Short

Research Area: Language Modeling

Research Area Keywords: Language Models

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 6907

Loading