Abstract: Unlike carefully curated academic benchmarks, real-world datasets are often highly class-imbalanced, especially in safety-critical scenarios. Through extensive empirical investigation, we study a number of foundational learning behaviors for various models such as neural networks, gradient-boosted decision trees, and SVMs under class imbalance across a range of domains. Motivated by our observation that re-balancing class-imbalanced training data is ineffective, we show that several simple techniques for improving representation learning are effective in this setting: (1) self-supervised pre-training is insensitive to imbalance and can be used for feature learning before fine-tuning on labels; (2) Bayesian inference is effective because neural networks are especially underspecified under class imbalance; (3) flatness-seeking regularization pulls decision boundaries away from minority samples, especially when we seek minima that are particularly flat on the minority samples’ loss.
1 Reply
Loading