GAN-Based Probabilistic Sampling for Biomedical Data Augmentation: A Comparative Study on Severely Imbalanced Single-Cell Classification
Keywords: Class imbalance, Generative Adversarial Networks, SMOTE, single-cell RNAseq, data augmentation, rare cell types, mode collapse
TL;DR: For severely imbalanced scRNA-seq data, data augmentation outperforms reweighting, with VAEs achieving the best rare cell classification performance, while GANs and weighted losses underperform at extreme imbalance.
Abstract: Single-cell RNA sequencing (scRNA-seq) datasets frequently exhibit severe class imbalance,
wherein rare cell types constitute less than 0.01% of the total cellular population,
thereby presenting substantial challenges for supervised classification methodologies. Traditional
resampling techniques prove inadequate in generating biologically meaningful synthetic
samples for extremely rare cell populations, while weighted loss functions demonstrate
a tendency toward overcompensation, resulting in elevated false-positive rates. In
this study, we conduct a comprehensive evaluation of nine methods for classifying severely
imbalanced scRNA-seq data. We introduce and compare two deep generative models, a
Generative Adversarial Network (GAN) and a Variational Autoencoder (VAE), against a
suite of standard techniques: original imbalanced training, Synthetic Minority Over-sampling
Technique (SMOTE), random undersampling, random oversampling, two forms of weighted
cross-entropy (Balanced and Effective Number), and a novel Hybrid approach combining
active learning with SMOTE. Using a synthetic dataset designed to mimic the extreme
imbalance of scRNA-seq data (up to 143:1 ratio), our analysis revealed that data-level augmentation
methods, particularly deep generative models, significantly outperform algorithmlevel
adjustments. The **Variational Autoencoder (VAE)** achieved the highest Macro
F1 score (20.9%), demonstrating its superior ability to model and generate synthetic samples
for rare classes. SMOTE also performed competitively (19.2%), confirming the utility
of interpolation-based methods. In contrast, both weighted cross-entropy methods (18.8%
and 17.6%) and the specialized Hybrid method (13.4%) underperformed, suggesting that
for this data distribution, generating new data is more effective than re-weighting existing
samples or employing complex active learning pipelines. Our findings indicate that for highdimensional,
severely imbalanced data, generative models like VAEs provide a more robust
and effective solution than traditional resampling or cost-sensitive learning. These results
provide evidence-based guidelines for method selection contingent upon class sample sizes
and demonstrate that GAN-based augmentation necessitates substantially greater minority
class representation than typically available in rare cell type investigations.
Submission Number: 20
Loading