Keywords: shortcut learning, spurious correlations, dataset scaling, gradient sensitivity, distribution shift, out-of-distribution generalization, synthetic binary classification, invariant causal feature, spurious feature reinforcement, optimizer implicit bias, SGD, Adam, AdamW, robustness diagnostics, critical onset scaling, beta-scaling phase boundary
TL;DR: Scaling training data can increase neural networks’ reliance on spurious shortcut features even when accuracy stays high, while Adam/AdamW reduce this shortcut amplification compared to SGD.
Abstract: Deep neural networks often exploit spurious shortcuts, non-causal correlations that fail under distribution shift. In a controlled synthetic binary classification setting with one invariant causal feature and one label-correlated shortcut, we study how shortcut reliance evolves with dataset scaling. Using gradient sensitivity to the spurious dimension as a direct functional diagnostic, we show a scaling-induced amplification effect: as training set size increases, models become increasingly sensitive to the shortcut feature despite near-saturated test accuracy. We further find that optimizer choice modulates this reinforcement, with Adam and AdamW substantially suppressing spurious gradient growth relative to stochastic gradient descent (SGD).
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 58
Loading