A test of stochastic parroting in a generalisation task: predicting the characters in TV series

15 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: stochastic parrot, pca, nlp, large language model
TL;DR: A test of stochastic parroting in a generalisation task: predicting the characters in TV series
Abstract: There are two broad, opposing views of the recent developments in large language models (LLMs). The first of these uses the term "stochastic parrots" from Emily Bender et al ("On the dangers of stochastic parrots: Can language models be too big?" 2021) to emphasise that because LLMs are simply a method for creating a probability distribution over sequences of words, they can be viewed as simply parroting information in the training data. The second view, "Sparks of AGI" from Sebastien Bubeck et al ("Sparks of artificial general intelligence: Early experiments with gpt-4" 2023), posits that the unprecedented scale of computation in the newest generation of LLMs is leading to what its proponents call "an early (yet still incomplete) version of an artificial general intelligence (AGI) system". In this article, we propose a method for making predictions purely from the representation of data inside the LLM. Specifically, we create a logistic regression model, using the principal components of a LLM model embedding as features, in order to predict an output variable. The task we use to illustrate our method is predicting the characters in TV series, based on their lines in the show. We show that our method can, for example, distinguish Penny and Sheldon in the Big Bang Theory with an AUC performance of 0.79. Logistic regression models for other characters in Big Bang Theory have lower values of AUC (ranging from 0.59 to 0.79), with the most significant distinguishing factors between characters relating to the number and nature of comments they make about women. The characters in the TV-series Friends are more difficult to distinguish using this method (AUCs range from 0.61 to 0.66). We find that the accuracy of our logistic regression on a linear feature space is slightly (around 3 percentage points) lower than GPT-4, which is in turn around 5 percentage points below that of a human expert. We discuss how the method we propose could be used to help researchers be more specific in the claims they make about large language models.
Supplementary Material: zip
Primary Area: Natural language processing
Submission Number: 17801
Loading