Supervised Text Classification with LLM-Generated Training LabelsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: The paper explores the viability of using synthetic training labels generated by GPT-4 to fine-tune supervised classifiers, finding performance of models tuned with synthetic labels is comparable to those tuned with human annotations.
Abstract: Computational social science practitioners often rely on human data to train supervised classifiers for text annotation. We assess the potential for researchers to augment or replace human-generated training data with synthetic training labels from generative large language models (LLMs). We introduce a recommended workflow and test this LLM application by measuring performance by replicating 14 classification tasks. We employ a novel corpus of English-language text classification data sets from recent computational social science articles in high-impact journals. Because these data sets are stored in password-protected archives, our analyses are less prone to issues of contamination. For each task, we compare supervised classifiers fine-tuned using GPT-4 labels against classifiers trained with human annotations and against GPT-4 few-shot labels. Our findings indicate that supervised classification models trained on LLM-generated labels perform comparably to models trained with labels from human annotators. Training models using LLM-generated labels is a fast, efficient and cost-effective method of building supervised text classifiers.
Paper Type: short
Research Area: Computational Social Science and Cultural Analytics
Contribution Types: NLP engineering experiment, Reproduction study, Approaches to low-resource settings
Languages Studied: English
0 Replies

Loading