Pareto Optimization for Active Learning under Out-of-Distribution Data Scenarios
Abstract: Pool-based Active Learning (AL) has proven successful in minimizing labeling costs by sequentially selecting the most informative unlabeled data from large pool and querying their labels from an oracle or annotators. However, existing AL sampling schemes may not perform well in out-of-distribution (OOD) data scenarios, where the unlabeled data pool contains samples that do not belong to the pre-defined categories of the target task. Achieving strong AL performance under OOD data scenarios presents a challenge due to the inherent conflict between AL sampling strategies and OOD data detection. For instance, both more informative in-distribution (ID) data and OOD data in an unlabeled data pool would be assigned high informativeness scores (e.g., high entropy) during AL processes. To address this dilemma, we propose a Monte-Carlo Pareto Optimization for Active Learning (POAL) sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool. We formulate the AL sampling task as a multi-objective optimization problem and employ Pareto optimization based on two conflicting objectives: (1) the conventional AL sampling scheme (e.g., maximum entropy) and (2) the confidence of excluding OOD data samples. Experimental results demonstrate the effectiveness of our POAL approach on classical Machine Learning (ML) and Deep Learning (DL) tasks.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: In Section 2.2, we have rewritten the related work section on distribution shift. In Section 3.3, we have made additions to the discussion on the use of GMM for ID confidence score calculation, as well as the challenges of insufficient data. In Section 3.4.4, we have rewrited and highlighted the differences between POAL and POSS. In Section 4.2.1, we have added discussions on the running time of different values, the degree of conflict between multiple objectives, and the challenges of insufficient data at the initial stages of AL. In Section 4.2.1, we have also discussed the weighted-sum optimization strategy in comparison to other multi-objective optimization techniques. In Section 4.2.1, we have added experimental comparison with pure ID-based baseline DDU and MAHA on CIFAR10 and CIFAR100 datasets. We have added Appendix Section C.2.1, which includes enlarged figures to improve readability. We have also added Appendix Section C.2.2, which presents a new visualization method using gumbel-distributed random variables to better understand the density maps.
Supplementary Material: zip
Assigned Action Editor: ~Chicheng_Zhang1
Submission Number: 847