Revisiting Active Learning in the Era of Vision Foundation Models

TMLR Paper2100 Authors

25 Jan 2024 (modified: 19 May 2024)Decision pending for TMLREveryoneRevisionsBibTeX
Abstract: Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks. Given these properties, they are a natural fit for \textit{active learning} (AL), which aims to maximize labeling efficiency. However, the full potential of foundation models has not been explored in the context of AL, specifically in the low-budget regime. In this work, we evaluate how foundation models influence three critical components of effective AL, namely, 1) initial labeled pool selection, 2) ensuring diverse sampling, and 3) the trade-off between representative and uncertainty sampling. We systematically study how the robust representations of foundation models (DINOv2, OpenCLIP) challenge existing findings in active learning. Our observations inform the principled construction of a new simple and elegant AL strategy that balances uncertainty estimated via dropout with sample diversity. We extensively test our strategy on many challenging image classification benchmarks, including natural images as well as out-of-domain biomedical images that are relatively understudied in the AL literature. Source code will be made available.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - **Abstract:** "... efficiency, but ..." $\rightarrow$ "... efficiency. However ..." - **Introduction:** The paragraph describing the key contributions of this paper has been edited for clarity in response to reviewer Zp5y. - **Section 3:** The description of the budget size $B$ has been edited to clearly indicate that class labels are not known at query time. - **Section 3.1:** In response to LkyM, we have revised our claims and added a more nuanced comparison of our results with those from Chandra et al. - **Section 3.5:** The summary of experimental results (3.1-3.4) is moved to a new section (3.5) to more effectively convey our key findings. - **Section 4:** The motivation for the construction of DropQuery has been clarified, with a detailed breakdown of how each component has been constructed based on the results in section 3 in response to reviewer LkyM. - **Section 5:** We have moved the ablations for DropQuery from the supplementary materials to the main text in section 5.4 in response to reviewers Zp5y and sEZP. - **Section 6:** In response to reviewer LkyM, we have added a note on long-tailed datasets. We have also acknowledged that our recommendation on semi-supervised learning is based on label propagation and this is not the only approach.
Assigned Action Editor: ~Zhiding_Yu1
Submission Number: 2100
Loading