Keywords: large language models, benchmark evaluation, robustness, adversarial perturbations, selective perturbations, model comparison, MMLU, GPQA
TL;DR: Selective perturbations reveal model-specific benchmark sensitivity that is hidden by aggregate accuracy scores.
Abstract: Benchmark accuracy is a useful summary of model performance, but it does not show how sensitive a model comparison is to question wording. We study this sensitivity with selective perturbations: small edits to multiple-choice questions that change the answer of one target model while preserving other models' answers. We implement this idea with a reference-preserving search constraint and evaluate the resulting perturbations on both reference models used during search and unseen models held out from the search. On the full MMLU dev split, unconstrained perturbations often degrade several models at once. With the selectivity constraint, a large target-specific component remains: across Gemma-3-12B, Llama-3.1-8B, and Qwen3.5-9B, target accuracy drops by 0.38--0.44, while reference drops remain at most 0.04 and unseen-model drops at most 0.10. Smaller supporting experiments on GPQA Diamond, within the Gemma family, with Gemini-2.5-Flash as target, and with selective improvement show the same qualitative pattern. Manual inspection suggests that the target-specific component is structured: Qwen3.5-9B is more often affected by coarse substitutions that corrupt domain anchors, while Gemma-3-12B is affected by milder edits such as near-synonyms, register shifts, and casing changes. These results suggest that aggregate benchmark scores can hide not only how often models fail, but also which local changes expose their failures.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 166
Loading