Abstract: Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks. Given these properties, they are a natural fit for _active learning_ (AL), which aims to maximize labeling efficiency. However, the full potential of foundation models has not been explored in the context of AL, specifically in the low-budget regime. In this work, we evaluate how foundation models influence three critical components of effective AL, namely, 1) initial labeled pool selection, 2) ensuring diverse sampling, and 3) the trade-off between representative and uncertainty sampling. We systematically study how the robust representations of foundation models (DINOv2, OpenCLIP) challenge existing findings in active learning. Our observations inform the principled construction of a new simple and elegant AL strategy that balances uncertainty estimated via dropout with sample diversity. We extensively test our strategy on many challenging image classification benchmarks, including natural images as well as out-of-domain biomedical images that are relatively understudied in the AL literature. We also provide a highly performant and efficient implementation of modern AL strategies (including our method) at https://github.com/sanketx/AL-foundation-models.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In response to the action editor's comments, we have made minor revisions.
- **Writing and structure:** Overall writing has been polished and checked for typos and errors.
- **MAE experiments:** We have included the requested MAE backbone experiments in supplementary section A.4
Code: https://github.com/sanketx/AL-foundation-models
Supplementary Material: pdf
Assigned Action Editor: ~Zhiding_Yu1
Submission Number: 2100
Loading