GAN-Based Probabilistic Sampling for Biomedical Data Augmentation: A Comparative Study on Severely Imbalanced Single-Cell Classification

Published: 20 Dec 2025, Last Modified: 20 Dec 2025SPARTA_AAAI2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Class imbalance, Generative Adversarial Networks, SMOTE, single-cell RNAseq, data augmentation, rare cell types, mode collapse
TL;DR: For severely imbalanced scRNA-seq data, data augmentation outperforms reweighting, with VAEs achieving the best rare cell classification performance, while GANs and weighted losses underperform at extreme imbalance.
Abstract: Single-cell RNA sequencing (scRNA-seq) datasets frequently exhibit severe class imbalance, wherein rare cell types constitute less than 0.01% of the total cellular population, thereby presenting substantial challenges for supervised classification methodologies. Traditional resampling techniques prove inadequate in generating biologically meaningful synthetic samples for extremely rare cell populations, while weighted loss functions demonstrate a tendency toward overcompensation, resulting in elevated false-positive rates. In this study, we conduct a comprehensive evaluation of nine methods for classifying severely imbalanced scRNA-seq data. We introduce and compare two deep generative models, a Generative Adversarial Network (GAN) and a Variational Autoencoder (VAE), against a suite of standard techniques: original imbalanced training, Synthetic Minority Over-sampling Technique (SMOTE), random undersampling, random oversampling, two forms of weighted cross-entropy (Balanced and Effective Number), and a novel Hybrid approach combining active learning with SMOTE. Using a synthetic dataset designed to mimic the extreme imbalance of scRNA-seq data (up to 143:1 ratio), our analysis revealed that data-level augmentation methods, particularly deep generative models, significantly outperform algorithmlevel adjustments. The **Variational Autoencoder (VAE)** achieved the highest Macro F1 score (20.9%), demonstrating its superior ability to model and generate synthetic samples for rare classes. SMOTE also performed competitively (19.2%), confirming the utility of interpolation-based methods. In contrast, both weighted cross-entropy methods (18.8% and 17.6%) and the specialized Hybrid method (13.4%) underperformed, suggesting that for this data distribution, generating new data is more effective than re-weighting existing samples or employing complex active learning pipelines. Our findings indicate that for highdimensional, severely imbalanced data, generative models like VAEs provide a more robust and effective solution than traditional resampling or cost-sensitive learning. These results provide evidence-based guidelines for method selection contingent upon class sample sizes and demonstrate that GAN-based augmentation necessitates substantially greater minority class representation than typically available in rare cell type investigations.
Submission Number: 20
Loading