Language Models Are Better Than Humans at Next-token Prediction

Published: 15 Jul 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, causal language models are not trained to perform well at these tasks; they are trained to accurately predict the next token given previous tokens in tokenized text. It is not clear whether language models are better or worse than humans at next-token prediction. To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity on OpenWebText. In both experiments, we find humans to be consistently \emph{worse} than relatively small language models like GPT-Neo-1.3B or GPT-2-large at next-token prediction.
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/FabienRoger/lm-game-analysis-main
Supplementary Material: zip
Assigned Action Editor: ~W_Ronny_Huang1
Submission Number: 2191
Loading