Keywords: Imbalanced classification, overfitting, margin, logistic regression, support vector machine, overparametrization, calibration
TL;DR: Overfitting in high-dimensional imbalanced classification arises from truncation/skewing effects on the logit distribution.
Abstract: Classification with imbalanced data is a common challenge in machine learning, where minority classes form only a small fraction of the training samples. Classical theory, relying on large-sample asymptotics and finite-sample corrections, is often ineffective in high dimensions, leaving many overfitting phenomena unexplained. In this paper, we develop a statistical theory for high-dimensional imbalanced linear classification, showing that dimensionality induces truncation or skewing effects on the logit distribution, which we characterize via a variational problem. For linearly separable Gaussian mixtures, logits follow $\\mathsf{N}(0,1)$ on the test set but converge to $\\max\\{\\kappa,\\mathsf{N}(0,1)\\}$ on the training set---a pervasive phenomenon we confirm on tabular, image, and text data.
This phenomenon explains why the minority class is more severely affected by overfitting. We further show that margin rebalancing mitigates minority accuracy drop and provide theoretical insights into calibration and uncertainty quantification.
Primary Area: learning theory
Submission Number: 23187
Loading