Keywords: shortcut learning, spurious correlations, gradient sensitivity, dataset scaling, optimizer implicit bias, feature sensitivity, distribution shift, robustness, neural networks, mathematical analysis, synthetic data, controlled experiments, generalization failure, feature dependence
TL;DR: Data scaling amplifies shortcut reliance: gradient sensitivity to a spurious feature rises with training size despite near-perfect accuracy. Amplification depends on optimizer and shortcut strength.
Abstract: Deep neural networks are known to exploit non-causal correlations that fail under distribution shift, yet how shortcut reliance evolves with dataset scaling remains unclear. We uncover a scaling-induced shortcut amplification phenomenon in a controlled binary classification setting consisting of one invariant causal feature that fully determines the label and one non-causal feature that is correlated during training but decorrelated at test time. To directly quantify functional shortcut dependence, we introduce a gradient-based sensitivity metric defined as the mean absolute derivative of the model logit with respect to the spurious coordinate, evaluated on decorrelated test data, which reveals latent shortcut reliance even when predictive accuracy remains near-saturated. We find that increasing the number of training samples systematically amplifies gradient sensitivity to the spurious feature despite negligible changes in test accuracy, indicating that scaling can simultaneously improve performance and reinforce non-causal feature dependence. Furthermore, this amplification is strongly modulated by optimization dynamics, with adaptive methods substantially suppressing spurious gradient growth relative to stochastic gradient descent. Finally, varying shortcut correlation strength reveals a structured scaling boundary governing the onset of substantial shortcut reliance, consistent with an empirical power-law relationship, demonstrating that shortcut amplification emerges from a joint interaction between data scale, correlation strength, and optimization bias.
Submission Number: 91
Loading