Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data

Corinna Cortes; Anqi Mao; Mehryar Mohri; Yutao Zhong

Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data

Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong

Published: 01 May 2025, Last Modified: 01 Sept 2025ICML 2025 posterEveryoneRevisionsBibTeXCC0 1.0

Abstract: Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes-consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong $H$-consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (*Imbalanced Margin Maximization*), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.

Lay Summary: Imagine you're training a learning algorithm to identify different types of animals in photos, but your dataset has 1,000 pictures of cats for every one picture of a rare leopard. A learning algorithm trained on this data will become an expert at spotting cats, but it will likely fail to recognize the leopard, simply because it's so rare. This "class imbalance" problem is a major challenge in machine learning, appearing in fields from medical diagnosis (rare diseases) to fraud detection (rare fraudulent activities). When the stakes are high, failing to identify the rare case can have serious consequences. Many current techniques try to solve this by either duplicating the rare data or telling the learning algorithm to pay extra attention to it. While these methods can sometimes help, they are more like patches than real solutions. They lack strong theoretical foundations, meaning we don't fully understand why they work or when they might fail. In fact, we show that some of these popular methods can be fundamentally flawed and may not lead to the best possible predictions, even with infinite data. This research builds a new, solid foundation for training learning algorithms on imbalanced data. We went back to the drawing board and designed a new learning method from scratch, specifically for these situations. Our approach, called IMMAX (Imbalanced Margin Maximization), teaches the learning algorithm to be confident in its predictions for all classes, not just the common ones. Crucially, we have proven mathematically that our method is reliable and will guide the learning algorithm toward the best possible performance. While our work is primarily theoretical, we also conducted experiments showing that algorithms based on our framework outperform existing methods in practice. This provides a more principled and effective way to build machine learning systems that can handle the "long tail" of rare but important events that are common in the real world.

Primary Area: General Machine Learning->Supervised Learning

Keywords: imbalanced data, consistency, margin bounds, learning theory

Submission Number: 7339

Loading