Model Behavior and Predictive Stability Under Severe Class Imbalance in High-Dimensional Classification

Published: 29 May 2026, Last Modified: 08 Jun 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: class imbalance, high dimensional classification, scRNA-seq, imbalanced learning, deep learning, machine learning, simulation study
TL;DR: Severe class imbalance produces substantially different minority class recovery and predictive stability patterns across classification models in high-dimensional settings.
Abstract: Class imbalance remains a major challenge in high-dimensional biological classification problems, particularly in single-cell RNA sequencing (scRNA-seq), where rare cell populations are often underrepresented. In this study, we evaluate the behavior of four classification models, including Elastic Net, Random Forest, Extreme Gradient Boosting (XGBoost), and multilayer perceptron (MLP), under varying imbalance and sample size conditions using a two-factor simulation framework derived from the Human Lung Cell Atlas (HLCA). Simulation settings vary across five minority class proportions and four sample size levels, with repeated train-test splits used to evaluate predictive performance. Under severe imbalance conditions, substantial differences in minority class recovery and predictive stability are observed across model architectures. Elastic Net shows unstable minority class recovery under extreme imbalance, including non-monotonic recall behavior at intermediate imbalance levels. In contrast, MLP demonstrates the fastest recovery as sample size and minority class proportion increase, while XGBoost maintains comparatively stable performance across imbalance regimes. Random Forest shows more gradual improvement as sample size increases. These results suggest that class imbalance interacts differently across model architectures and substantially influences predictive stability and minority class recovery in high-dimensional classification settings.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 44
Loading