TL;DR: unsupervised learning for class distribution mismatch
Abstract: Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an "other" category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability and performance.
To address this, we propose Unsupervised Learning for Class Distribution Mismatch (UCDM), which constructs positive-negative pairs from unlabeled data for classifier training. Our approach randomly samples images and uses a diffusion model to add or erase semantic classes, synthesizing diverse training pairs. Additionally, we introduce a confidence-based labeling mechanism that iteratively assigns pseudo-labels to valuable real-world data and incorporates them into the training process.
Extensive experiments on three datasets demonstrate UCDM’s superiority over previous semi-supervised methods. Specifically, with a 60\% mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%, and 72.5% in classifying known, unknown, and new classes.
Lay Summary: Machine learning models often assume that the training data has the same class distribution as the real-world data they’ll encounter, but this is rarely true. When the classes seen during training don’t match those in real-world tasks, models struggle, especially if labeled data is limited. Existing semi-supervised learning methods try to handle this mismatch using a small amount of labeled data, which limits their usefulness.
To solve this, we developed UCDM, a new method that doesn’t rely on labeled data. Instead, it learns by generating training examples from unlabeled images. Using the diffusion model, we add or remove visual content from images to create diverse pairs of examples. We also introduce a way to assign pseudo-labels to real images to train the model automatically. Our approach outperforms previous methods on several challenging datasets, showing that learning from unlabeled, mismatched data is both possible and effective.
Link To Code: https://github.com/RUC-DWBI-ML/research/tree/main/UCDM-master
Primary Area: General Machine Learning->Unsupervised and Semi-supervised Learning
Keywords: unsupervised learning, class distribution mismatch, machine learning, diffusion model
Submission Number: 3112
Loading