A Statistical Theory of Overfitting for Imbalanced Classification

Jingyang Lyu; Kangjie Zhou; Yiqiao Zhong

A Statistical Theory of Overfitting for Imbalanced Classification

Jingyang Lyu, Kangjie Zhou, Yiqiao Zhong

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Imbalanced classification, overfitting, margin, logistic regression, support vector machine, overparametrization, calibration

TL;DR: Overfitting in high-dimensional imbalanced classification arises from truncation/skewing effects on the logit distribution.

Abstract: Classification with imbalanced data is a common challenge in machine learning, where minority classes form only a small fraction of the training samples. Classical theory, relying on large-sample asymptotics and finite-sample corrections, is often ineffective in high dimensions, leaving many overfitting phenomena unexplained. In this paper, we develop a statistical theory for high-dimensional imbalanced linear classification, showing that dimensionality induces truncation or skewing effects on the logit distribution, which we characterize via a variational problem. For linearly separable Gaussian mixtures, logits follow $\\mathsf{N}(0,1)$ on the test set but converge to $\\max\\{\\kappa,\\mathsf{N}(0,1)\\}$ on the training set---a pervasive phenomenon we confirm on tabular, image, and text data. This phenomenon explains why the minority class is more severely affected by overfitting. We further show that margin rebalancing mitigates minority accuracy drop and provide theoretical insights into calibration and uncertainty quantification.

Primary Area: learning theory

Submission Number: 23187

Loading