Keywords: Robustness, Evaluation
Abstract: Robustness is often regarded as a critical future challenge for real-world applications, where stability is essential. In this work, we question this assumption and explore the relationship between robustness and performance, hypothesizing that high performance in a task serves as a strong indicator of robustness. Through an empirical analysis of multiple models across diverse datasets and configurations (e.g., paraphrases, different temperatures), we find a strong positive correlation: as models approach high performance on a task, robustness is effectively achieved. This effect persists beyond ``trivial robustness'' expected from high success rates and holds across architectures. Our findings suggest that robustness is primarily driven by task-specific competence rather than inherent model-level properties, challenging current approaches that treat robustness as an independent capability. Thus, looking at the field from a high-level perspective, we may expect that as new tasks saturate model robustness on these tasks will emerge accordingly. This calls for a reduced focus on measuring and improving robustness, as it is likely to resolve naturally with performance gains.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: Robustness, Saturation, Evaluation
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 9456
Loading