Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements
Abstract: In this short paper, we propose a “Generalization Stress Test” to assess Large Language Models' (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements.
We achieve novel and significant findings that, despite high benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B's MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT4o experiences a 25-point accuracy loss when problem types are changed, with a 6-point drop across all three modification categories.
These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts.
Paper Type: Short
Research Area: Language Modeling
Research Area Keywords: Language Models
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 6907
Loading