Keywords: Active learning, Data synthesis, Machine learning, Influence function
Abstract: Active learning (AL) reduces data annotation costs by querying labels from human annotators for the most informative unlabeled data points during model training. Existing AL methods generally assume the availability of a large amount of unlabeled samples for query selection. However, collecting raw data in practice can be expensive, even without considering the cost of labeling. Membership query synthesis circumvents the need for an unlabeled data pool by directly generating informative queries from the input space. Nevertheless, existing approaches often generate instances lacking semantic meaning, thereby increasing the difficulty of labeling. In this paper, we propose the Generative Membership Query Descriptor (GenMQD) method for AL to mitigate the risk of generating unrecognizable instances. The key idea is to generate textual descriptions of the desired data, instead of the data samples themselves. Then a pre-trained multi-modal alignment model (e.g., CLIP) can be leveraged to transform these features into natural language texts for data gathering purposes. Extensive experiments on image classification benchmark datasets against query synthesis state-of-the-art methods demonstrate that, on average, GenMQD can improve model accuracy by 2.43\% when gathering and labeling 500 examples. A large-scale user study verifies that human oracles prefer GenMQD generated queries over generated image-based queries.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6254
Loading