TL;DR: We introduce a novel extension of multi-armed bandit problem with applications to LLM decoding and decoding-time alignment.
Abstract: We introduce the tokenized linear bandit (TLB) and multi-armed bandit (TMAB), variants of linear and stochastic multi-armed bandit problems inspired by LLM decoding and alignment. In these problems, at each round $t \in [T]$, a user submits a query (context), and the decision maker (DM) sequentially selects a token irrevocably from a token set. Once the sequence is complete, the DM observes a random utility from the user, whose expectation is presented by a sequence function mapping the chosen token sequence to a nonnegative real value that depends on the query.
In both problems, we first show that learning is impossible without any structure on the sequence function.
We introduce a natural assumption, diminishing distance with more commons (DDMC), and propose algorithms with regret $\tilde{O}(L\sqrt{T})$ and $\tilde{O}(L\sqrt{T^{2/3}})$ for TLB and TMAB, respectively.
As a side product, we obtain an (almost) optimality of the greedy decoding for LLM decoding algorithm under DDMC, which justifies the unresaonable effectiveness of greedy decoding in several tasks.
This also has an immediate application to decoding-time LLM alignment, when the misaligned utility can be represented as the frozen LLM's utility and a linearly realizable latent function.
We finally validate our algorithm's performance empirically as well as verify our assumptions using synthetic and real-world datasets.
Lay Summary: Large language models (LLMs) like ChatGPT generate responses one word (or token) at a time. But how should they choose the next word in a way that aligns best with what the user wants? This paper introduces a new mathematical framework to study this problem using ideas from a field called multi-armed bandits, which is often used to model decision-making under uncertainty.
In our problem setting, a user submits a question, and the system chooses one word at a time to form a complete response. After the response is finished, the system receives feedback — a score measuring how good the response was. The challenge is to learn how to pick better responses over time.
We show that without any structure, learning is hopeless. But with a natural assumption (that similar tokens lead to similar outcomes), they develop new algorithms that learn effectively and provide strong performance guarantees. Surprisingly, our results also explain why simple decoding methods like greedy generation (choosing the best word at each step) often work well in practice. Our findings are supported with experiments using both synthetic and real-world data.
Primary Area: Theory->Online Learning and Bandits
Keywords: Contextual Bandit, Multi-armed Bandit, Large Language Model, LLM Alignment, Decoding-time Alignment, Decoding Algorithm
Submission Number: 7775
Loading