Towards Coding Social Science Datasets with Language Models

Anonymous

Towards Coding Social Science Datasets with Language Models

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Researchers often rely on humans to code (label, annotate, etc.) large sets of texts. This is a highly variable task and requires a great deal of time and resources. Efforts to automate this process have achieved human-level accuracies in some cases, but often rely on thousands of hand-labeled training examples, which makes them inapplicable to small-scale research studies and still costly for large ones. At the same time, it is well known that language models can classify text; in this work, we use OpenAI's GPT-3 as a synthetic coder, and explore what classic methodologies and metrics (such as intercoder reliability) look like in this new context. We find that GPT-3 is able to match the performance of typical human coders and frequently outperforms humans in terms of intercoder agreement across a variety of social science tasks, suggesting that language models could be a useful tool to the social sciences.

0 Replies

Loading