Keywords: inference, coannotation, labeling, uncertainty quantification, confidence intervals
Abstract: Obtaining high-quality labeled datasets is often costly, requiring either
human annotation or expensive experiments.
In theory, powerful pre-trained AI models provide an opportunity to
automatically label datasets and save costs.
Unfortunately, these models provide no guarantees on their accuracy,
making wholesale replacement of manual labeling impractical.
In this work, we propose a method for leveraging pre-trained AI models to curate
cost-effective and high-quality datasets.
In particular, our approach results in
*probably approximately correct labels*: with high probability, the overall
labeling error is small.
Our method is nonasymptotically valid under minimal assumptions on the dataset or
the AI model being studied, and thus enables rigorous yet efficient dataset
curation using modern AI models. We demonstrate the benefits of the methodology
through text annotation with large language models, image labeling with
pre-trained vision models, and protein folding analysis with AlphaFold.
Supplementary Material: zip
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Submission Number: 17773
Loading