Distribution Aware Active Learning via Gaussian Mixtures

Younghyun Park; Jungwuk Park; Dong-Jun Han; Wonjeong Choi; Humaira Kousar; Jaekyun Moon

Distribution Aware Active Learning via Gaussian Mixtures

Younghyun Park, Jungwuk Park, Dong-Jun Han, Wonjeong Choi, Humaira Kousar, Jaekyun Moon

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Active Learning, Uncertainty Estimation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: This work aims to mitigate overfitting in AL by reducing distributional discrepancy between labeled set and unlabeled set with the aid of distributional information obtained by Gaussian Mixutre Models (GMM)

Abstract: In active learning (AL), the distribution of labeled samples in a latent space is often dissimilar to that of unlabeled samples, depending on various factors such as labeled set size or data selection strategy. This distributional discrepancy hampers both evaluation and estimation of informativeness on unseen data, and remains an important issue in AL. In this paper, we propose a robust distribution-aware learning and sample selection strategy that employs Gaussian Mixture Model (GMM) to effectively encapsulate both labeled and unlabeled sets for AL. By utilizing the GMM statistics derived from all available data, the proposed approach is able to construct a more diverse feature representation, thereby reducing the risk of overfitting to limited patterns. Specifically, we propose a regularization method that supervises GMM posteriors under the concept of metric learning, and introduce a semi-supervised learning method that feeds GMM statistics into an adversarial discriminator to prevent memorization of samples. Furthermore, we propose a new informativeness metric that utilizes GMM likelihoods to detect overfitted areas in the latent space, and then devise a hybrid sample selection strategy that takes advantage of the properties of different informativeness metrics. Extensive experimental results demonstrate that our GMM-based method outperforms existing works on various balanced and imbalanced datasets, and can be readily integrated with other AL schemes to further improve the performance.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3306

Loading