Keywords: OOD Generalization Benchmarks
Abstract: Benchmarks for out-of-distribution (OOD) generalization often reveal a strong positive correlation between in-distribution (ID) and OOD accuracy across models, a phenomenon known as “accuracy-on-the-line.” This pattern is commonly interpreted as evidence that spurious correlations—relationships that improve ID but harm OOD performance—are rare in practice. We show that this positive correlation can be an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy-on-the-line breaks down. Across widely used distribution-shift benchmarks, OODSelect uncovers subsets—sometimes comprising more than half of the standard OOD set—where higher ID accuracy predicts lower OOD accuracy. These results suggest that aggregate metrics can mask critical failure modes in OOD robustness. We release code and the identified subsets to support further research.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 6145
Loading