Language Models Are Better Than Humans at Next-token Prediction

TMLR Paper2191 Authors

13 Feb 2024 (modified: 28 Mar 2024)Under review for TMLREveryoneRevisionsBibTeX
Abstract: Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, causal language models are not trained to perform well at these tasks; they are trained to accurately predict the next token given previous tokens in tokenized text. It is not clear whether language models are better or worse than humans at next-token prediction. To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity on OpenWebText. In both experiments, we find humans to be consistently \emph{worse} than relatively small language models like GPT-Neo-1.3B or GPT-2-large at next-token prediction.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~W_Ronny_Huang1
Submission Number: 2191
Loading