The ALCHEmist: Automated Labeling 500x CHEaper than LLM Data Annotators

Tzu-Heng Huang; Catherine Cao; Vaishnavi Bhargava; Frederic Sala

The ALCHEmist: Automated Labeling 500x CHEaper than LLM Data Annotators

Tzu-Heng Huang, Catherine Cao, Vaishnavi Bhargava, Frederic Sala

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Automated Data Labeling; Code Generation; Weak Supervision

TL;DR: We propose an alternative program distillation approach to replace expensive annotation processes that require repetitive prompting for labels.

Abstract: Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, $\textbf{Alchemist}$, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a $\textbf{12.9}$% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately $\textbf{500}\times$.

Supplementary Material: zip

Primary Area: Other (please use sparingly, only use the keyword field for more details)

Submission Number: 7027

Loading