Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Jiacheng Liu; Sewon Min; Luke Zettlemoyer; Yejin Choi; Hannaneh Hajishirzi

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Data, Science of LMs, Compute efficient LMs, Engineering for large LMs, Inference algorithms for LMs

Keywords: infini-gram,n-gram,language model,suffix array

TL;DR: We built the largest ever n-gram LM on trillions of tokens and with unbounded n, developed a method to efficiently train and serve it, and showed its great utility in this era of neural LLMs.

Abstract: Are $n$-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is *yes*, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- **5 trillion tokens**. This is one of the largest $n$-gram LMs ever built. Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new **$\infty$-gram LM** with backoff. Instead of pre-computing $n$-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$-gram (as well as $n$-gram with arbitrary $n$) probabilities with **millisecond-level latency**. The $\infty$-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the $\infty$-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their perplexity. When analyzing machine-generated text, we also observe irregularities in the machine: $\infty$-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 59

Loading