Abstract: Probabilistic next-token prediction trained using cross-entropy loss is the basis of most large language models. Given a sequence of previous values, next-token prediction assigns a probability to each possible next value in the vocabulary. From this, there are many ways to turn these next-token predictions into token sequences. This paper examines a few of these algorithms (greedy/lookahead decoding, random sampling, and temperature-scaled random sampling) and studies their consistency with respect to the user end goals of information retrieval and creative generation through encoding these goals as loss functions. Although the consistency of surrogate losses with respect to a target loss function is a well researched topic, we are the first to study it in the context of LLMs (to the best of our knowledge). We find that, so long as next-token prediction converges to its true probability distribution, random sampling is consistent with outputting sequences that mimic sampling from the true probability distribution. For the other goals, such as minimizing the 0-1 loss on the entire sequence, we show that deterministic decoders have the edge over stochastic decoders. From these results, we see that there is a dichotomy created between the goals of information retrieval and creative generation for the decoding algorithms. This shows that choosing the correct decoding algorithm based on the desired goal is extremely important and many of the ones used are lacking theoretical grounding in numerous scenarios. While there has been evidence for this empirically, this paper gives rigorous theoretical grounding to these results.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Ruqi_Zhang1
Submission Number: 8217
Loading