Keywords: Classification, Clustering, Subclass Imbalance, Preferential Attachment
Abstract: Classification models in machine learning are typically trained with coarse-grained class labels, which overlook fine-grained subclass variations. This phenomenon, known as hidden stratification [1], results in asymmetric performance; models excel on dominant subclasses but struggle on rare or underrepresented ones. Such biases critically undermine fairness and robustness, especially in safety-sensitive applications such as medical imaging. We introduce Subclass-Aware Inclusive Classification (SAIC), a framework shown in Figure 1 that explicitly addresses hidden stratification. SAIC operates in two stages: (i) unsupervised subclass identification using a repulsive point process (k-DPP [2]) to uncover diverse and representative latent subclasses without prior assumptions, and (ii) subclass-aware classification with Group Distributionally Robust Optimization (GDRO), which emphasizes minimizing worst-case subclass loss. Extensive experiments on four benchmark datasets (MNIST, CIFAR-10, Waterbirds, and CelebA) show that SAIC consistently improves robustness without compromising overall accuracy. Specifically, we compare against K-means- and GMM-generated subclasses [3, 1] and also give the accuracy obtained using true subclass labels, as given in Table 1. Beyond overall accuracy, SAIC’s clustering module demonstrates superior subclass identification, closely matching true subclass counts, preserving rare subclass purity, and maintaining moderate runtime efficiency. SAIC provides a scalable solution to hidden stratification by combining diversity-aware subclass discovery with robust optimization, thereby enhancing fairness and reliability in high-stakes classification tasks.
Submission Number: 386
Loading